Voxtral Transcribe 2 from Mistral: transcribe in real time | Keryc
Mistral just launched Voxtral Transcribe 2, a family of speech-to-text models built for two things we all want: accuracy and speed. Want live captions with no lag, or to transcribe hours of meetings while identifying who spoke? This is aimed straight at that problem.
What Mistral offers
The offering comes in two clear flavors:
Voxtral Mini Transcribe V2: a batch transcription model with diarization, timestamps, and support for 13 languages.
Voxtral Realtime: designed for live applications, with configurable latency down to under 200 ms. Its weights are open under the Apache 2.0 license.
Mistral also added an audio playground in Mistral Studio so you can try transcription instantly, with diarization and timestamps.
The most important
Ultra-low latency: Realtime can run below 200 ms, ideal for voice agents and smooth conversational experiences.
Quality and efficiency: Mini Transcribe V2 reaches around 4% word error rate on the FLEURS benchmark and is offered at $0.003 USD per minute — a tough combo to beat today.
Open weights: Realtime is released under Apache 2.0 and can be deployed at the edge for privacy-sensitive applications.
Multilingual support: 13 native languages, including Spanish, English, Chinese, Hindi, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.
Voxtral Realtime
This variant doesn’t just adapt an offline model; it uses a streaming architecture that transcribes as the audio arrives. That enables very low-delay settings and an almost instant feel.
Performance: at 2.4 s of delay it matches the batch model. At 480 ms it stays within a 1–2% error difference, enough for voice assistants that feel fast and reliable.
Size and deployment: a 4 billion parameter footprint that can run efficiently on edge devices, helping keep sensitive data local.
License: open weights under Apache 2.0, available on Hugging Face.
Voxtral Mini Transcribe V2
Built to transcribe large volumes with enterprise features and quality:
Speaker diarization with start and end times per word so you know who said what and when. In overlap cases, usually one speaker is transcribed.
Context biasing: you can pass up to 100 words or phrases to improve spelling of proper names and technical terms. Optimized for English; support in other languages is experimental.
Word-level timestamps: useful for captions, audio search, and syncing content.
Noise robustness: designed for tough environments like factories or call centers.
Long audio: processes up to 3 hours in a single request.
Price and comparison: around 4% WER and $0.003 USD per minute. According to Mistral, it outperforms options like GPT-4o mini Transcribe, Gemini 2.5 Flash, Assembly Universal and Deepgram Nova in accuracy, and processes audio roughly 3x faster than ElevenLabs Scribe v2 at the same quality level but a fraction of the cost.
Audio playground in Mistral Studio
You can try Voxtral Transcribe 2 directly in Mistral Studio. Key features:
Upload up to 10 files, formats .mp3, .wav, .m4a, .flac, .ogg, up to 1 GB each.
Toggle diarization, choose timestamp granularity, and add terms for context bias.
It’s a practical way to evaluate real results before integrating the API.
Real-world use cases
Meetings and interviews: multilingual transcripts with speaker attribution for analysis and summaries.
Voice assistants: sub-200 ms latency connected to an LLM and TTS for natural conversational experiences.
Contact centers: live transcription to suggest responses, analyze sentiment, and update CRMs while the call continues.
Media and broadcasts: low-latency live captions in multiple languages.
Compliance and auditing: diarization and timestamps provide traceability for regulatory needs.
Privacy, deployment and pricing
Realtime: open weights under Apache 2.0, available on Hugging Face. Enables edge deployments to keep data local.
Compliance: Mistral states support for GDPR and HIPAA-compliant deployments via on-premise or private cloud setups.
API prices announced:
Voxtral Mini Transcribe V2: $0.003 USD per minute.
Voxtral Realtime: $0.006 USD per minute.
Who this is for and why it matters
If you work with audio at scale, build voice assistants, or run contact centers, this shifts the balance between cost, speed, and quality. Need to transcribe large volumes with decent diarization? Mini V2 promises efficiency. Does your product demand minimal latency? Realtime opens the door to more natural, private conversational experiences.
Mistral is betting on putting open weights in the community’s hands without sacrificing commercial performance. The result? More options to deploy voice solutions with better control over privacy and costs.