Benchmark: ASR in code-switched speech for voice agents | Keryc
More than half the world speaks more than one language. What if your customers mix languages in the same sentence? That happens all the time in contact centers and help desks, and mistakes in the first stage —automatic speech recognition— contaminate the rest of the pipeline. ServiceNow and team built a benchmark to answer a direct question: how do frontier ASR systems behave with code-switching in enterprise scenarios?
What they did
They focused on the critical stage: automatic speech recognition (ASR). They started from real HR and IT support interactions and generated a synthetic corpus of utterances that mix a matrix language with English fragments. Why synthetic? Because they chose to control language mixing and audio quality using TTS to ensure coverage and reproducibility.
The flow looked like this:
They took parallel pairs (English + another language) and filtered candidates with good potential for code-switching.
They kept sentences between 12 and 40 words, and required at least three “switchable” words (nouns, verbs, adjectives) to keep results natural.
They used a persona prompt in an LLM (OpenAI/GPT-5) to generate the code-switched text, then ran a verbalization pass to make it spoken and synthesized audio with ElevenLabs Multilingual V2.
A native linguist of the matrix language reviewed each recording; failures were excluded or regenerated.
The final dataset contains:
259 Spanish-English records
298 French-English
188 Canadian French-English
173 German-English
They also released their benchmark and data through their AU-Harness to evaluate voice models.
Dataset, metrics and models evaluated
They measured three key dimensions:
WER (Word Error Rate): word-level accuracy.
SWER (Semantic WER): rate of errors that affect meaning, based on an LLM judge (Gemma-4-31B) following Pipecat's implementation.
AER (Answer Error Rate): a functional metric that generates three questions per utterance and checks whether an LLM can answer them from the transcription (methodology inspired by Bhushan et al.).
Models evaluated (7):
AssemblyAI / Universal 3-Pro
Deepgram / Nova 3 Multilang
ElevenLabs / Scribe V2
Google / Gemini 3 Flash
Mistral AI / Voxtral Small 24B-2507
Nvidia / Parakeet TDT 0.6b V3
OpenAI / Whisper Large V3 Turbo
Main results
The leaders in transcription accuracy were ElevenLabs Scribe V2 and AssemblyAI Universal 3-Pro, very close; Scribe holds a slight overall edge.
Gemini 3 Flash shows up as key when the metric is semantic (AER, SWER), likely due to LALM optimizations for comprehension and reasoning.
In the mid-table are Deepgram, Mistral and Nvidia, each improving on at least a couple of languages. Parakeet closes the middle group, performing best on German-English.
Whisper Large V3 Turbo lands at the bottom for a clear reason: without an explicit language parameter it tends to translate into English rather than transcribe, which hurts its WER.
Short conclusion: Scribe V2, Gemini 3 Flash and AssemblyAI are the best at handling code-switching in this benchmark, both in accuracy and in preserving meaning.
What's the 'cost' of code-switching?
To measure it they isolated each utterance into three audios: code-switched, monolingual matrix (L2) and monolingual English. The WER difference between code-switched and monolingual gives the cost of switching.
The top models (Scribe, Gemini, AssemblyAI) show small deltas; Scribe even beats its own L2 baseline in some cases.
Less robust models degrade more, suggesting code-switching amplifies robustness differences between models instead of creating a uniform difficulty.
Whisper shows the largest relative degradation to English (up to +0.85 in German-English), and is the only one that sometimes does better on code-switched than monolingual L2 because of its habit of translating.
Technical analysis: factors that predict errors
They applied a two-part model: first a logistic regression to identify which variables increase the probability of having at least one error; then a conditional OLS to see which factors affect the magnitude of the error if one already occurred.
Predictors used:
Number of language switches in the utterance
CMI (Code-Mixing Index): proportion of words in the secondary language
Utterance length (control)
Findings:
The count of switches is the most consistent predictor of whether an error occurs. Each switch introduces a new chance to fail, especially clear in French-English.
Once an error happens, the magnitude of the error relates more to CMI: the more densely mixed the utterance, the higher the observed WERs (notable in German-English).
They also analyzed where errors occur at the token level using GPT-5 to label language per token. The surprising result:
Errors concentrate in the English portions of the utterance, not in the matrix language. That's counterintuitive because English is usually the best-handled language by models in monolingual settings.
Possible explanations:
The English portions contain more technical vocabulary and proper names that are hard.
Or inserting a stretch in another language creates an acoustic/linguistic context that makes mid-utterance adaptation harder for the model.
Limitations and practical recommendations
Important limitations:
The benchmark is synthetic: audio generated by TTS. It doesn't necessarily capture prosody, accents and phonetic variation of real speakers.
All tests ran in automatic language detection mode. Many systems allow forcing language or using hints that could improve production results.
Per-language WER excludes insertions because attributing language to insertions is hard.
Recommendations if you run bilingual voice agents:
Don't trust general rankings; benchmark with your customers' languages. The best model for Spanish-English won't necessarily be best for German-English.
Test with both synthetic audio and real samples from your users to capture accents and prosody.
Consider using LALMs or models with strong semantic ability if your flow depends on understanding (forms, routing, entity extraction).
Experiment with configuration options: forced language tokens, multilingual hints and pipelines that detect switches and adapt the model.
Final reflection
Code-switching stops being an exotic failure and is becoming a normal condition that top ASR can handle with small penalties. Does that mean you can deploy a bilingual assistant without more checks? Almost, but not before validating with your specific languages and scenarios. The technical part is clear: measure both accuracy and preservation of meaning, because a reasonable WER doesn't guarantee that critical information (case numbers, requests, names) reaches your systems intact.