Gemma 4 and Cerebras power real-time voice AI | Keryc
Hugging Face and Cerebras present an open real-time voice stack that makes conversations with AI feel natural. Want the answer when you expect it, not several seconds later? The trick is cutting LLM latency with fast, stable inference so responses arrive on time.
What they announced
Hugging Face put together a real-time speech-to-speech demo that uses WebSocket for interactive voice chat. The pipeline is modular and fully open: you can inspect, swap, and adapt each component for assistants, robots, or research projects.
The full sequence is:
Voice input
Speech recognition with Nvidia Parakeet
Inference with Gemma 4 VLM (Google DeepMind, 31B) running on Cerebras hardware
Voice synthesis with Alibaba Qwen3TTS
Spoken reply
Hugging Face already uses this stack in Reachy Mini robots — over 9,000 units — where speed is not a luxury: it’s what makes the interaction feel alive.
Technical architecture and why it matters
The proposal joins three open-ecosystem strengths: Cerebras inference speed, the Gemma 4 31B model capability, and Qwen3TTS quality. Each layer is interchangeable, so you can experiment — swap the STT, try a different TTS — without rebuilding everything.
Technically relevant:
Low-latency communication uses WebSocket to avoid the overhead of traditional HTTP requests.
The critical step is LLM inference (here Gemma 4 VLM). If that takes several seconds, the conversation falls apart.
Cerebras targets that bottleneck: it reduces latency and variance (jitter), improving especially the long-tail behavior (P95).
Many stacks get a good median latency, but spikes in P95 and P99 break the sense of naturalness. What Cerebras brings is predictability in latency.
Latency, stability and experience
Here’s a detail that doesn’t always get discussed: for fluid conversations you need stability, not just a good average. Would you rather have an occasional multi-second pause, or a slightly slower but consistent reply? Most people choose predictability.
Improving both average latency and the high percentiles keeps the conversational flow intact. And when you add tool calls or multimodal steps (for example, using vision or external engines), each round adds latency. Keeping LLM inference predictable makes those integrations feel more immediate.
Use cases and practical applicability
Real-time dialog robots (example: Reachy Mini)
Assistants embedded in IoT products and social robots
Immersive VR/AR experiences where responsiveness is critical
Voice support and accessibility services where every second counts
If you work on product, this isn’t just about costs: it’s about delivering an experience users perceive as reliable and human.
For developers and researchers
The stack is modular and open source. You can:
Replace the STT with another engine if you need specific dialect support
Try variants of Gemma 4 or smaller models to balance cost/latency
Tune the TTS for voice nuances or languages
Demo: Hugging Face Space
Repositorio: huggingface/speech-to-speech
They invite you to explore the demo, review the code, and contribute. Want to experiment with real time? This is a good open lab: it combines infrastructure engineering (latency, jitter, WebSocket) with model work (size, accuracy, multimodal capability).
We’re looking at a clear idea: future conversational AI will be both open and efficient. It’s not just releasing models; it’s integrating them into systems that respond when people need them.