Nemotron 3.5 ASR: fine-tune a multilingual, real-time ASR | Keryc
Nemotron 3.5 ASR arrives as a practical answer to real problems you face when building speech-transcribing products: support for many languages without deploying 40 models, live subtitling with low latency without losing accuracy, final text already punctuated and capitalized, and hot language detection. Want a single model that does all that? Here I explain how it works and how to fine-tune it for your language, domain, or accent.
What Nemotron 3.5 ASR solves
If you ever built a transcription service, you know these walls:
Supporting many languages turns your infra into a museum of point integrations.
Realistic streaming often re-processes overlapping windows and spikes latency and cost.
Raw output without punctuation forces you to add another model to restore it.
Many systems require knowing the language beforehand; what if the speaker switches language mid-sentence?
Nemotron 3.5 ASR unifies the solutions: a single 600M-parameter checkpoint that covers 40 locales, real streaming without recompute, punctuated output and language conditioning (you can force it with or leave ).
target_lang=es-ES
target_lang=auto
Architecture and why it's efficient
Two main pieces:
A 24-layer Cache-Aware FastConformer encoder. FastConformer is an efficient evolution of Conformer with linearly scalable attention. The "cache-aware" part stores internal states (self-attention and convolution activations) from previous frames so each audio frame is processed exactly once. Result? No duplicated work, less compute and lower latency without sacrificing accuracy.
An RNNT decoder (Recurrent Neural Network Transducer). RNNT emits text as audio arrives, ideal for live transcription.
The model also uses prompt conditioning for language identification: every clip can carry a target_lang signal that specializes the output. With auto the model detects the language and appends it as a tag at the end of each sentence.
Latency vs. accuracy: the control is in your hands
The balance between how fast text is emitted and how much future audio you "peek" is controlled with att_context_size. Some practical points:
Attention Context
Chunk Size (Latency)
Suggested use
[56,0]
80ms (Ultra-Low)
Ultra-fast voice agents
[56,1]
160ms (Low)
Interactive agents
[56,3]
320ms (Balanced)
Conversation and live subtitling
[56,6]
560ms (Medium)
High accuracy with reasonable latency
[56,13]
1.12s (High)
Maximum accuracy when latency isn't critical
The same checkpoint covers the whole spectrum: you pick the operating point at inference time, no retraining needed.
First steps: clone NeMo and try the model
Clone NeMo and point the streaming script at your audio:
Practical notes: audio should be mono .wav and the manifest a JSON-lines with audio_filepath, duration and text. The model adds a language_tag at the end of each completed sentence (for example "This is a test. "). strip_lang_tags=true removes that tag for readability.
How and when to fine-tune — the technical recipe
The proposed workflow is simple and reproducible:
Prepare tarred data with NeMo/Lhotse for efficient streaming, without unpacking files.
Fine-tune from the base checkpoint using the same Cache-Aware FastConformer-RNNT recipe and conditioning each clip with target_lang.
Evaluate with a held-out set using the same latency setting you'll run in production (for example att_context_size=[56,0]).
Add more data for weak languages/scenarios and repeat.
Export and deploy the fine-tuned checkpoint.
Key points that make a difference:
Each clip must carry the correct target_lang. Prompt conditioning is powerful but intolerant of wrong labels.
Keep the text style of the base model: punctuated and capitalized.
To avoid harming other languages, mix ("replay") a fraction of data from the other languages when fine-tuning.
Always evaluate at the deployment latency; measuring offline or with look-ahead will give you a fantasy of performance.
A concrete example: fine-tuning with ~2,000 hours balanced (Greek + Bulgarian) from public mixes (Granary, Common Voice, FLEURS) and using held-out splits from FLEURS for honest testing.
Representative results (WER in low-latency streaming 80ms):
Language
Base model WER (%)
Fine-tuned WER (%)
Relative improvement
Greek
35
24
32%
Bulgarian
22
15
31%
Additional data (including 2,000 h of parliamentary speech) continued to improve the weaker locales, although gains were not uniform: quantity helps, but domain match matters.
Practical tips for engineering and deployment
On small datasets, a quick pass on a single GPU can be enough to see improvements; scale to multi-GPU for full runs.
Use a "fixed step budget" that works with streaming/iterable data instead of counting epochs.
Protect languages you aren't fine-tuning with replay and check that they don't degrade.
Export the fine-tuned checkpoint to your serving path: the architecture doesn't change, only the weights.
For production, NVIDIA plans a NIM release with gRPC streaming and support across Ampere, Hopper, Blackwell, Lovelace, Turing, Volta and Jetson. That makes it easier to integrate into low-latency pipelines and on-device scenarios.
Notable use cases
Sub-second voice agents: ASR -> LLM -> TTS without ASR becoming the bottleneck.
Live, multilingual subtitles from a single stream.
Global contact-center analytics with one ASR backend instead of many vendors.
On-device transcription on Jetson for privacy or offline scenarios.
Final recommendations and risks to watch
Fine-tuning Nemotron 3.5 ASR can be transformative for low-resource languages or specialized domains. But remember: measure at the latency you'll deploy, label target_lang correctly, and mix data from other languages to avoid unexpected degradation. More data helps, but domain match matters as much as quantity.