Gemini 3.1 Flash TTS: the next generation of expressive AI speech
By Jakub Antkiewicz
•2026-04-16T09:18:39Z
Google Releases Gemini 3.1 Flash TTS with Director-Level Audio Controls
Google has rolled out Gemini 3.1 Flash TTS, its latest text-to-speech model, now in preview across the Gemini API, Google AI Studio, and Vertex AI. The release focuses on giving developers more precise control over AI-generated speech through a new feature called “audio tags.” This matters because it moves the technology beyond simply generating natural-sounding audio and into the realm of directed vocal performance, allowing for nuanced expression in applications ranging from creative content to enterprise-level communication tools.
The core technical advance is the implementation of audio tags, which are natural language commands embedded directly into text to steer vocal style, pace, and delivery. According to performance benchmarks from Artificial Analysis, the model achieved an Elo score of 1,211 and is positioned favorably for its blend of high quality and low cost. The system is designed for global scale, supporting over 70 languages and native multi-speaker dialogue. For responsible deployment, all audio generated by the model is watermarked with SynthID to help identify AI-generated content.
- Platform Availability: Preview on Gemini API, Google AI Studio, Vertex AI, and Google Vids.
- Key Feature: 'Audio tags' for granular control over pace, tone, and delivery.
- Developer Tools: AI Studio offers scene direction, speaker-specific profiles, and API code export for consistent voice performance.
- Performance: 1,211 Elo score on the Artificial Analysis TTS leaderboard.
- Safety: All audio output includes an imperceptible SynthID watermark.
The introduction of Gemini 3.1 Flash TTS directly challenges specialized voice AI companies by integrating highly customizable text-to-speech capabilities into Google's broader developer ecosystem. By focusing on enterprise-ready tools within Vertex AI and intuitive controls for developers, Google is positioning expressive AI speech not as a standalone product, but as a deeply integrated feature for building more engaging applications. This could accelerate the adoption of sophisticated, brand-specific AI voices in customer service bots, accessibility tools, and dynamic content creation platforms, shifting the competitive landscape from raw vocal quality to the ease and depth of creative control.
Google's introduction of 'audio tags' in Gemini 3.1 Flash TTS shifts the text-to-speech market's focus from mere voice replication to nuanced vocal direction, commoditizing high-fidelity AI audio while giving developers director-level control over performance.