Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech
By Jakub Antkiewicz
•2026-06-10T11:40:12Z
ServiceNow Benchmark Reveals Top ASR Models for Bilingual Enterprise Use
Researchers at ServiceNow have released a new benchmark evaluating how well modern Automatic Speech Recognition (ASR) systems handle code-switching, the common practice of mixing languages mid-sentence. The study addresses a critical gap for enterprises serving bilingual customers, where transcription errors in voice agents for IT or HR support can lead to significant operational failures. The findings identify ElevenLabs Scribe V2, Google Gemini 3 Flash, and AssemblyAI Universal 3-Pro as the most capable models for transcribing these complex, mixed-language interactions.
The benchmark measured seven ASR systems across four language pairs—Spanish-English, French-English, Canadian French-English, and German-English—using a custom dataset of enterprise scenarios. Performance was assessed using three key metrics: Word Error Rate (WER) for raw accuracy, Semantic Word Error Rate (SWER) for meaning preservation, and Answer Error Rate (AER) to test downstream task comprehension. The results showed a clear hierarchy in performance, with some models incurring a much larger penalty than others when faced with code-switched audio compared to monolingual speech. Notably, OpenAI's Whisper Large V3 Turbo struggled significantly, often defaulting to translating the audio into English rather than transcribing it, which resulted in high error rates.
This research provides a critical framework for enterprises selecting voice AI vendors. It demonstrates that robustness to code-switching is a key differentiator, not a universal capability among leading ASR models. The study also found that while raw transcription accuracy is important, models with strong underlying language understanding can sometimes outperform others on functional tasks, even with slightly higher word-level errors. This distinction is vital for businesses where the semantic accuracy of a support ticket or customer query directly impacts resolution time and operational cost.
Key Model Performance
- ElevenLabs Scribe V2: Emerged as the top performer, leading across most metrics and language pairs with the lowest error rates.
- Google Gemini 3 Flash: Showed exceptional semantic performance, consistently outperforming AssemblyAI on the Answer Error Rate (AER) metric, making it highly effective for downstream tasks.
- AssemblyAI Universal 3-Pro: Delivered excellent transcription accuracy (WER), ranking a close second to Scribe V2 and establishing itself as a top-tier option for precision.
- OpenAI Whisper Large V3 Turbo: Consistently ranked last, as its tendency to translate rather than transcribe code-switched audio proved to be a major limitation for this use case.
Strategic Takeaway: Enterprise ASR procurement must now treat code-switching not as an edge case, but as a core competency. This benchmark reveals that raw transcription accuracy (WER) is an incomplete metric; models like Google's Gemini 3 Flash demonstrate that superior semantic understanding can compensate for minor transcription errors, directly impacting downstream task success and operational efficiency.