[voxtral] language detection + skipping lang:xx (#41225)

* proc + doc update

* improve doc

* add lang:xx in decode

* update voxtral test

* nit

* nit

* update test value

* use regex
This commit is contained in:
eustlb
2025-10-10 11:18:30 +02:00
committed by GitHub
parent f4487ec521
commit c5094a4f97
5 changed files with 74 additions and 10 deletions

View File

@ -264,7 +264,7 @@ class VoxtralForConditionalGenerationIntegrationTest(unittest.TestCase):
outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
EXPECTED_OUTPUT = [
'The audio is a humorous exchange between two individuals, likely friends or acquaintances, about tattoos. Here\'s a breakdown:\n\n1. **Initial Reaction**: One person (let\'s call him A) is surprised to see the other person (let\'s call him B) has a tattoo.\n2. **Curiosity**: A asks B what his tattoo says, and B responds with "sweet."\n3. **Repetition**: This exchange is repeated multiple times, with A asking about B\'s tattoo and B responding with "sweet."\n4. **Clarification**: Eventually, B clarifies that A\'s tattoo says "dude" and A\'s says "sweet."\n5. **Final Insult**: B calls A an "idiot" for not understanding the joke.\n\nThe humor comes from the repetition of the word "sweet" and the confusion about the tattoos\' meanings. The final insult adds a touch of frustration to the exchange.'
'The audio is a humorous exchange between two individuals, likely friends or acquaintances, about tattoos. Here\'s a breakdown:\n\n1. **Initial Reaction**: One person (let\'s call him A) is surprised to see the other person (let\'s call him B) has a tattoo. A asks if B has a tattoo, and B confirms.\n\n2. **Tattoo Description**: B then asks A what his tattoo says, and A responds with "sweet." This exchange is repeated multiple times, with B asking A what his tattoo says, and A always responding with "sweet."\n\n3. **Misunderstanding**: B seems to be genuinely curious about the meaning of the tattoo, but A is either not paying attention or not understanding the question. This leads to a series of repetitive responses from A.\n\n4. **Clarification**: Eventually, B clarifies that he wants to know what A\'s tattoo says, not what A thinks B\'s tattoo says. A then realizes his mistake and apologizes.\n\n5. **Final Answer**: B then asks A what his tattoo says, and A finally responds with "dude," which is the actual meaning of his tattoo.\n\n6. **Final Joke**: B then jokes that A\'s tattoo says "sweet," which is a play on words, as "sweet" can also mean "good" or "nice."\n\nThroughout the conversation, there\'s a lot of repetition and misunderstanding, which adds to the humor. The final joke about the tattoo saying "sweet" is a clever twist on the initial confusion.'
]
self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)
@ -487,6 +487,7 @@ class VoxtralForConditionalGenerationIntegrationTest(unittest.TestCase):
see https://github.com/huggingface/transformers/pull/39429 PR's descrition.
disclaimer: Perfect token matching cannot be achieved due to floating-point arithmetic differences between vLLM and Transformers implementations.
"""
# test without language detection
model = VoxtralForConditionalGeneration.from_pretrained(
self.checkpoint_name, dtype=self.dtype, device_map=torch_device
)
@ -501,6 +502,21 @@ class VoxtralForConditionalGenerationIntegrationTest(unittest.TestCase):
decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
EXPECTED_OUTPUT = [
"lang:enThis week, I traveled to Chicago to deliver my final farewell address to the nation, following in the tradition of presidents before me. It was an opportunity to say thank you. Whether we've seen eye-to-eye or rarely agreed at all, my conversations with you, the American people, in living rooms and schools, at farms and on factory floors, at diners and on distant military outposts, All these conversations are what have kept me honest, kept me inspired, and kept me going. Every day, I learned from you. You made me a better president, and you made me a better man. Over the course of these eight years, I've seen the goodness, the resilience, and the hope of the American people. I've seen neighbors looking out for each other as we rescued our economy from the worst crisis of our lifetimes. I've hugged cancer survivors who finally know the security of affordable health care. I've seen communities like Joplin rebuild from disaster, and cities like Boston show the world that no terrorist will ever break the American spirit. I've seen the hopeful faces of young graduates and our newest military officers. I've mourned with grieving families searching for answers, and I found grace in a Charleston church. I've seen our scientists help a paralyzed man regain his sense of touch, and our wounded warriors walk again. I've seen our doctors and volunteers rebuild after earthquakes and stop pandemics in their tracks. I've learned from students who are building robots and curing diseases and who will change the world in ways we can't even imagine. I've seen the youngest of children remind us of our obligations to care for our refugees, to work in peace, and above all, to look out for each other. That's what's possible when we come together in the slow, hard, sometimes frustrating, but always vital work of self-government. But we can't take our democracy for granted. All of us, regardless of party, should throw ourselves into the work of citizenship. Not just when there's an election. Not just when our own narrow interest is at stake. But over the full span of a lifetime. If you're tired of arguing with strangers on the Internet, try to talk with one in real life. If something needs fixing, lace up your shoes and do some organizing. If you're disappointed by your elected officials, then grab a clipboard, get some signatures, and run for office yourself. Our success depends on our"
"This week, I traveled to Chicago to deliver my final farewell address to the nation, following in the tradition of presidents before me. It was an opportunity to say thank you. Whether we've seen eye-to-eye or rarely agreed at all, my conversations with you, the American people, in living rooms and schools, at farms and on factory floors, at diners and on distant military outposts, All these conversations are what have kept me honest, kept me inspired, and kept me going. Every day, I learned from you. You made me a better president, and you made me a better man. Over the course of these eight years, I've seen the goodness, the resilience, and the hope of the American people. I've seen neighbors looking out for each other as we rescued our economy from the worst crisis of our lifetimes. I've hugged cancer survivors who finally know the security of affordable health care. I've seen communities like Joplin rebuild from disaster, and cities like Boston show the world that no terrorist will ever break the American spirit. I've seen the hopeful faces of young graduates and our newest military officers. I've mourned with grieving families searching for answers, and I found grace in a Charleston church. I've seen our scientists help a paralyzed man regain his sense of touch, and our wounded warriors walk again. I've seen our doctors and volunteers rebuild after earthquakes and stop pandemics in their tracks. I've learned from students who are building robots and curing diseases and who will change the world in ways we can't even imagine. I've seen the youngest of children remind us of our obligations to care for our refugees, to work in peace, and above all, to look out for each other. That's what's possible when we come together in the slow, hard, sometimes frustrating, but always vital work of self-government. But we can't take our democracy for granted. All of us, regardless of party, should throw ourselves into the work of citizenship. Not just when there's an election. Not just when our own narrow interest is at stake. But over the full span of a lifetime. If you're tired of arguing with strangers on the Internet, try to talk with one in real life. If something needs fixing, lace up your shoes and do some organizing. If you're disappointed by your elected officials, then grab a clipboard, get some signatures, and run for office yourself. Our success depends on our"
]
self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)
# test with language detection, i.e. language=None
inputs = self.processor.apply_transcription_request(
audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
model_id=self.checkpoint_name,
)
inputs = inputs.to(torch_device, dtype=self.dtype)
outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
EXPECTED_OUTPUT = [
"This week, I traveled to Chicago to deliver my final farewell address to the nation, following in the tradition of presidents before me. It was an opportunity to say thank you. Whether we've seen eye-to-eye or rarely agreed at all, my conversations with you, the American people, in living rooms and schools, at farms and on factory floors, at diners and on distant military outposts, All these conversations are what have kept me honest, kept me inspired, and kept me going. Every day, I learned from you. You made me a better president, and you made me a better man. Over the course of these eight years, I've seen the goodness, the resilience, and the hope of the American people. I've seen neighbors looking out for each other as we rescued our economy from the worst crisis of our lifetimes. I've hugged cancer survivors who finally know the security of affordable health care. I've seen communities like Joplin rebuild from disaster, and cities like Boston show the world that no terrorist will ever break the American spirit. I've seen the hopeful faces of young graduates and our newest military officers. I've mourned with grieving families searching for answers, and I found grace in a Charleston church. I've seen our scientists help a paralyzed man regain his sense of touch, and our wounded warriors walk again. I've seen our doctors and volunteers rebuild after earthquakes and stop pandemics in their tracks. I've learned from students who are building robots and curing diseases and who will change the world in ways we can't even imagine. I've seen the youngest of children remind us of our obligations to care for our refugees, to work in peace, and above all, to look out for each other. That's what's possible when we come together in the slow, hard, sometimes frustrating, but always vital work of self-government. But we can't take our democracy for granted. All of us, regardless of party, should throw ourselves into the work of citizenship. Not just when there's an election. Not just when our own narrow interest is at stake. But over the full span of a lifetime. If you're tired of arguing with strangers on the Internet, try to talk with one in real life. If something needs fixing, lace up your shoes and do some organizing. If you're disappointed by your elected officials, then grab a clipboard, get some signatures, and run for office yourself. Our success depends on our"
]
self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)

View File

@ -12,7 +12,9 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import base64
import gc
import io
import tempfile
import unittest
@ -31,6 +33,7 @@ if is_mistral_common_available():
import mistral_common.tokens.tokenizers
from mistral_common.exceptions import InvalidMessageStructureException
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.protocol.transcription.request import TranscriptionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.tokens.tokenizers.utils import list_local_hf_repo_files
@ -259,6 +262,31 @@ class TestMistralCommonTokenizer(unittest.TestCase):
):
self.tokenizer.decode(tokens_ids, skip_special_tokens=False, unk_args="")
def test_decode_transcription_mode(self):
# in the specific case of Voxtral, the added f"lang:xx" (always a two char language code since it follows ISO 639-1 alpha-2 format)
# is not considered as a special token by mistral-common and is encoded/ decoded as normal text.
# we made the explicit choice of skipping "lang:xx" it to ease users life, see `[~MistralCommonTokenizer.decode]`
expected_string = "lang:en[TRANSCRIBE]"
openai_transcription_request = {
"model": None,
"language": "en",
"file": io.BytesIO(base64.b64decode(AUDIO_BASE_64)),
}
transcription_request = TranscriptionRequest.from_openai(openai_transcription_request)
tokenized_transcription_request = self.ref_tokenizer_audio.encode_transcription(transcription_request)
# without skip_special_tokens
self.assertEqual(
self.tokenizer_audio.decode(tokenized_transcription_request.tokens, skip_special_tokens=False)[
-len(expected_string) :
],
expected_string,
)
# with skip_special_tokens
self.assertEqual(self.tokenizer.decode(tokenized_transcription_request.tokens, skip_special_tokens=True), "")
def test_batch_decode(self):
string = "Hello, world!"
string_with_space = "Hello, world !"
@ -919,7 +947,7 @@ class TestMistralCommonTokenizer(unittest.TestCase):
conversation, tokenize=True, return_dict=True, return_tensors="pt"
)
def test_appsly_chat_template_with_truncation(self):
def test_apply_chat_template_with_truncation(self):
conversation = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hi!"},