Voxtral

Paper · arXiv 2507.13264 · Published July 17, 2025

We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a diverse range of audio benchmarks, while preserving strong text capabilities. Voxtral Small outperforms a number of closedsource models, while being small enough to run locally. A 32K context window enables the model to handle audio files up to 40 minutes in duration and long multiturn conversations.

Voxtral is pretrained on a large-scale corpus of audio and text documents, and subsequently instruction tuned on real and synthetic data. It is capable of responding directly to audio (or text) and answering questions about audio files. With a 32K token context window, Voxtral is capable of processing audio files up to 40 minutes long.

Compared with similarly sized models in the same evaluation setting, we find that Voxtral delivers strong audio reasoning capabilities without sacrificing text-only performance. Its performance is state-of-the-art for speech transcription and translation, outperforming other open-weights and closed models. In speech question-answering (QA) and summarization, it performs comparably with closed models of a similar price class, such as GPT-4o mini [Hurst et al., 2024] and Gemini 2.5 Flash [Comanici et al., 2025].

During evaluation of Voxtral and other models, we found that the existing ecosystem of speech evaluations lacked breadth and standardization; the majority of previous work focused on evaluation of transcription and translation quality, and less on other understanding tasks. In Section 3.4, we present evaluations that measure a wider range of speech comprehension and reasoning tasks.