Nemotron Speech ASR optimizes real-time voice agents | Keryc
In voice interactions there's always been an awkward trade-off: speed versus accuracy. Sound familiar—the trick of processing overlapping audio windows to keep context? It's like re-reading the last pages every time you turn the sheet.
Nemotron Speech ASR, NVIDIA's new open model designed for real-time voice agents, breaks that loop. Based on the FastConformer architecture with 8x downsampling, it introduces cache-aware streaming that processes only the new audio deltas. Instead of redoing the context, it reuses previous representations, achieving up to 3x more efficiency than buffer-based inference systems.
What Nemotron Speech ASR is and why it matters
Nemotron Speech ASR is a family of open models in the Nemotron line, optimized from architecture to inference for low-latency, high-concurrency voice agents.
Technically, it's an RNNT encoder on top of FastConformer with depth-wise separable convolutional subsampling that reduces the token rate to process per second ( vs the traditional ). That cuts VRAM use significantly and raises GPU throughput.
8x
4x
Key parameters:
Size: ~600M parameters, tuned for high-performance NVIDIA GPUs.
Input: 16 kHz streaming audio.
Output: English text with punctuation and capitalization.
Runtime-configurable latency modes: 80ms, 160ms, 560ms, 1.12s (no retrain required).
The core innovation: instead of re-encoding overlapping windows, Nemotron keeps an internal cache of encoder representations across all self-attention and convolutional layers. When new audio arrives, the model updates that cached state; each audio frame is processed exactly once. Result? Most redundant computation disappears and you avoid the latency drift that plagues legacy systems.
Immediate benefits: lower end-to-end latency, linear memory scaling, higher GPU efficiency, and lower operational costs.
Performance and important metrics
Numbers tell the story: on an NVIDIA H100, Nemotron handles 560 concurrent streams with 320ms chunks, versus 180 streams for the previous baseline (3x improvement). On RTX A5000 reported gains exceed 5x in concurrency; on DGX B200 there are up to 2x improvements in certain setups.
Beyond throughput, Nemotron is stable: latency grows almost flat as streams increase, without the drift seen in buffer-based inference. In Modal tests with 127 simultaneous WebSocket clients over 3 minutes, median latency was 182ms with no appreciable drift.
Accuracy versus latency (practical example): increasing chunk latency from 160ms to 560ms improves WER from 7.84% to 7.22%. In other words, you can control the trade-off at runtime without retraining.
Time-to-final: Nemotron reports medians of 24ms. For reference, local alternatives on L40 hover around 90ms and public APIs 200ms or more.
In a full pipeline (ASR Nemotron + Nemotron 3 Nano 30B + Magpie TTS + Pipecat) the local voice-to-voice loop dropped under 900ms, enough for natural conversational turn-taking and cuts.
Cache-aware streaming engineering (what happens under the hood)
The idea is simple and powerful: keep encoder states cached and update only what's new. Concretely:
Intermediate activations from self-attention and convolutional layers are stored.
When new frames arrive, the encoder concatenates or mixes the new representation with the cache without re-evaluating the old.
The 8x subsampling reduces the number of tokens flowing through attention, cutting memory and compute.
There's also prediction chunking and lookahead logic to keep memory bounded and latency predictable. The model processes each frame once and avoids recalculating overlapping context, which mitigates the memory filling that causes drift in traditional systems.
How to try it today (practical steps)
If you want to validate it and run a proof of concept:
Clone and run Nemotron Speech ASR from Hugging Face (huggingface.co link in the source).
Use NVIDIA NeMo to enable cache-aware streaming in your inference pipeline.
Deploy the endpoint on Modal for scaled tests and WebSocket streaming.
Integrate with orchestration frameworks like Pipecat and TTS like Magpie if you need a voice-to-voice loop.
These steps let you measure latency at scale and compare cost per stream versus buffer-based solutions or proprietary APIs.
Who this is for and what limitations you should consider
Nemotron targets developers and companies building real-time voice agents: meeting assistants, customer support, multimodal interfaces, and collaborative apps.
Limitations and considerations:
Streaming output is currently oriented to English; you'll need to evaluate accent and noise coverage on your data.
Peak performance is achieved on NVIDIA hardware; optimizations can be hardware-dependent.
While it reduces cost per stream, significant GPU infrastructure is still required for large loads.
Final thoughts
Cache-aware streaming changes how we think about real-time ASR: you no longer have to choose between speed and scale. Nemotron shows that with an architecture designed for streaming you can get fast, stable, and reproducible transcriptions at high concurrency.
If you're building a serious voice agent, the question isn't whether you'll use optimized ASR, but when you'll migrate so your voice conversations stop feeling artificial.