Gemma 4 and Cerebras power real-time voice AI

Hugging Face and Cerebras present an open real-time voice stack that makes conversations with AI feel natural. Want the answer when you expect it, not several seconds later? The trick is cutting LLM latency with fast, stable inference so responses arrive on time.

What they announced

Hugging Face put together a real-time speech-to-speech demo that uses WebSocket for interactive voice chat. The pipeline is modular and fully open: you can inspect, swap, and adapt each component for assistants, robots, or research projects.

The full sequence is:

Voice input
Speech recognition with Nvidia Parakeet
Inference with Gemma 4 VLM (Google DeepMind, 31B) running on Cerebras hardware
Voice synthesis with Alibaba Qwen3TTS
Spoken reply

Hugging Face already uses this stack in Reachy Mini robots — over 9,000 units — where speed is not a luxury: it’s what makes the interaction feel alive.

What they announced

Technical architecture and why it matters

Latency, stability and experience

Use cases and practical applicability

For developers and researchers

Original source

Stay up to date!

Gemma 4 and Cerebras power real-time voice AI