Open ASR Leaderboard: trends in multilingual ASR 2025 | Keryc
While everyone and their grandmothers are spinning up new ASR models, picking the right one for your use case can be more confusing than choosing the next show on your list. The Open ASR Leaderboard has become a clear reference to compare accuracy and efficiency, and it just added tracks for multilingual and long-form — exactly where benchmarking was most needed.
🌍 Multilingual: broader coverage but usually lower per-language performance
⌛ Long-form: closed systems still lead; open source has clear potential
🧑💻 Fine-tuning guides: Parakeet, Voxtral, Whisper to shortcut stages
As a reference, as of Nov 21, 2025 the leaderboard compares 60+ models from 18 organizations across 11 datasets. Sound overwhelming? Let’s break it down.
What the Open ASR Leaderboard measures and why it matters
The leaderboard doesn’t just look at how many words get transcribed correctly. It combines accuracy metrics like WER (word error rate) with efficiency metrics like the inverse of the real-time factor (RTFx). In plain terms: WER tells you how often the model slips up; RTFx tells you how fast it processes audio (higher is better).
Why does that matter today? Many traditional benchmarks focus on short English clips (<30s). But real-world production includes hours-long meetings, podcasts and multilingual scenarios. Those use cases demand evaluating throughput and robustness over long durations.
Main technical trends
Conformer + LLM decoders: accuracy in front
Models that pair a Conformer encoder with LLM-based decoders are leading in WER for English. Examples: NVIDIA Canary-Qwen-2.5B, IBM Granite-Speech-3.3-8B and Microsoft Phi-4-Multimodal-Instruct. Bringing LLM reasoning helps correct ambiguities and use context to improve accuracy.
Pro-tip: NVIDIA introduced Fast Conformer, a variant roughly 2x faster used in Canary and Parakeet.
Speed: CTC and TDT for extreme throughput
If you prioritize speed, CTC and TDT decoders are the go-to. They offer between 10× and 100× more throughput than LLM decoders, with a moderate penalty in WER. Perfect for real-time transcription, batch processing or long-running pipelines.
Practical example: NVIDIA Parakeet CTC 1.1B reaches an RTFx of 2793.75, while Whisper Large v3 has 68.56. The WER gap is small (6.68 vs 6.43), but the impact on costs and infrastructure is large.
Multilingual vs. specialization: the eternal tradeoff
OpenAI Whisper Large v3 remains a multilingual reference, supporting 99 languages. However, tuned or distilled variants (Distil-Whisper, CrisperWhisper) can outperform the original on English-only tasks. The lesson: fine-tuning for one language boosts performance for that language but reduces coverage.
Self-supervised systems like Meta MMS and Omnilingual ASR support 1000+ languages, but lose to encoders designed for specific languages in precision. Currently only five languages are benchmarked in the leaderboard’s multilingual track, with plans to expand.
Long-form: closed systems still have the edge
For long transcripts (podcasts, lectures, meetings), closed systems tend to lead. Reasons include domain-specific fine-tuning, custom chunking strategies and production optimizations. Among open models, Whisper Large v3 is top for accuracy, but CTC-based Conformers dominate for throughput.
This points to a clear path for the community: better chunking, reassembly and hybrid pipelines can close the gap.
Practical recommendations for choosing or implementing ASR
If you need maximum accuracy in English: look for Conformer + LLM models or fine-tune a checkpoint with domain-specific data.
If you need low latency or high-volume processing: prioritize CTC or TDT for throughput; tune your acceptable WER depending on the use case.
If your product is multilingual: consider Whisper Large v3 or MMS systems, but plan hybrid strategies if a critical language needs higher accuracy.
For long audio: optimize chunking, train with long-form data and evaluate RTFx in addition to WER.
Want to experiment? Fine-tuning guides for Parakeet, Voxtral and Whisper are still available to help you adapt open models to specific tasks.
Community, datasets and the future
The leaderboard is community-driven. Local initiatives (Arabic ASR, Russian ASR) already show how dialectal variation and diglossia complicate models. The invite is to contribute datasets, checkpoints and evaluations: more languages and scenarios will make benchmarks more useful.
It’s also likely that hybrid architectures and efficiency improvements (e.g., optimized Conformer variants or lightweight decoders with context abilities) will shift the map soon. What will surprise us in six months? Probably something that combines LLM-level accuracy with CTC-like efficiency.
The door is open for innovation in long-form and multilingual ASR. Want your model compared? Open a pull request to the leaderboard repo and upload your results.