Google improves Gemini with native audio for voice agents

Dec 12, 2025Keryc Díaz4 minutes

This week Google updated its Gemini audio models to make voice interactions more natural and powerful. What does that mean for you — as a user, a developer, or a business thinking about voice assistants? Less robotic answers, more useful conversations, and new possibilities for real-time translation.

What the update brings

Google released an improved version called Gemini 2.5 Flash Native Audio aimed at live voice agents. It’s not just about generating more expressive speech (they already advanced that with Gemini 2.5 Pro and Flash TTS); it’s about improving how the AI understands complex workflows, follows instructions, and keeps dialogue coherent.

The update is already available in products like Google AI Studio and Vertex AI, and it’s rolling out to Gemini Live and Search Live. In practice this lets you, for example, brainstorm live with Gemini, get real-time help from Search Live, or build enterprise-capable customer service agents.

Key improvements

Google highlights three areas where the model levels up:

More accurate function calls: the model better identifies when to consult external information and reintegrate it into the audio response without breaking the conversation. On the ComplexFuncBench Audio benchmark, which measures multiple calls to functions under constraints, Gemini 2.5 Native Audio scores 71.5%.
Better instruction following: fidelity to developer prompts increases, with an adherence rate of 90% (up from 84%), which translates into more complete and reliable outputs.
Smoother conversations: multi-turn quality improves; the model retrieves context from previous turns more effectively, making the interaction sound more coherent and natural.

Practical result: fewer interruptions, fewer out-of-context replies, and greater consistency during long sessions.

What customers are saying

Several companies are already using these capabilities for real results:

Shopify: their VP of Product says users sometimes forget they’re talking to AI, and the new Live APIs help increase sales.
UWM (United Wholesale Mortgage): after integrating Gemini 2.5 Flash Native Audio, they say they were able to generate thousands of loans thanks to better interactions.
Newo.ai: with AI receivers using Vertex AI, they can identify speakers in noisy environments, switch languages mid-conversation, and keep emotional expressiveness.

These testimonials aren’t just hype; they show concrete commercial cases from phone support to transaction processing.

Live voice translation: hearing the world in your language

One of the most striking features is live voice translation for headphones. It works in two modes:

Continuous listening: translates multiple languages into a target language as a stream, so you can put on headphones and hear what’s being said around you in your language.
Bidirectional conversation: translates between two languages in real time, automatically switching the output language depending on who’s speaking. Example: you speak English and your interlocutor speaks Hindi; you hear the translation in English and your phone transmits Hindi when you finish speaking.

Practical features:

Coverage: more than 70 languages and 2,000 language pairs.
Style transfer: preserves intonation, rhythm, and timbre so the translation sounds natural.
Multilingual input: detects and understands multiple languages in the same session.
Auto-detection: you don’t need to select the language; the system detects it and starts translating.
Noise robustness: filters ambient noise for conversations outdoors or in loud places.

The beta is available starting today in the Google Translate app for Android in the United States, Mexico, and India; iOS and more regions are coming soon. Google plans to bring it to the Gemini API in 2026 based on feedback.

For developers and businesses

If you want to build voice agents, Gemini 2.5 Flash Native Audio is generally available in Vertex AI and in preview on the Gemini API. The Gemini 2.5 Flash and 2.5 Pro TTS models are also available from the API in Google AI Studio.

Useful resources mentioned by Google:

Developer documentation and guides for speech generation.
Prompting guide and the Gemini API Cookbook to get started with examples.

Why this matters now

Because voice stops being just a channel and becomes an interface that understands context, actions, and human nuance. For businesses it means better call automation, less user frustration, and new forms of multilingual communication. For people, it means being able to talk with tools that understand complex instructions and translate in real time without sounding artificial.

Interested in trying it or integrating it into a product? Start with Vertex AI or the Gemini API preview and play with the examples in Google AI Studio; the best way to understand it is by listening.

Original source

https://blog.google/products/gemini/gemini-audio-model-updates

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.