Gemini 3.1 Flash Live: Google's Realtime Voice AI With 90+ Languages [2026]

Google released Gemini 3.1 Flash Live on March 26, 2026. It’s an audio-to-audio model built for realtime voice conversations, and it ships with support for over 90 languages. That’s the widest language coverage of any voice AI model available today.

The “Live” part matters. This isn’t a text model that happens to accept audio input. Flash Live processes speech directly as audio and generates spoken responses without the usual speech-to-text-to-LLM-to-text-to-speech pipeline. The result is lower latency and better preservation of vocal nuance, things like tone, emphasis, and pacing that get lost when you convert speech to text and back.

For anyone building or using voice AI agents, this is a significant release. Here’s what it does and why it matters.

Gemini 3.1 Flash Live: Google’s Realtime Voice AI for 90+ Languages

Flash Live sits within Google’s Gemini model family. Where earlier Gemini models focused on text, images, and code, Flash Live is optimized specifically for spoken dialogue. Google calls it their “highest-quality audio model” and has integrated it into both consumer products (Gemini Live, Search Live) and developer tools (Gemini Live API in Google AI Studio).

The model launched alongside availability in over 200 countries through Google’s consumer apps. For developers, access comes through the Gemini Live API, which allows building custom voice applications on top of the model.

Why does the “realtime” part deserve attention? Traditional voice AI systems chain multiple models together. Speech recognition converts audio to text. A language model generates a text response. A text-to-speech engine converts that back to audio. Each step adds latency, and each conversion loses information. Flash Live collapses some of these steps by working directly with audio, similar to what OpenAI did with their GPT Realtime API.

The direction is clear across the industry: audio-native models are replacing the chain-of-models approach for voice applications. Google is making a big bet on this with Flash Live.

90+ Languages: The Widest Language Support in Voice AI

Ninety languages. To put that in context, OpenAI’s voice mode supports roughly 50 languages. Safina supports 50+ languages for phone calls. ElevenLabs covers around 30 for conversational AI. Flash Live’s 90+ figure is the largest language set any single voice model has shipped with.

For businesses with global reach, more languages from one model means simpler infrastructure. Instead of routing calls through different models depending on the caller’s language, a single model handles the detection and response. Flash Live includes automatic language detection and can switch languages mid-conversation, which matters for bilingual callers or regions where code-switching is common.

The question is quality versus quantity. Supporting a language at a basic level is different from handling it well enough for a business phone call. Accent variation, dialect differences, and domain-specific vocabulary all affect how useful a model is in practice. A model that handles 90 languages at 80% quality faces different trade-offs than one handling 50 languages at 95% quality.

Google has not published detailed per-language benchmarks for Flash Live. The 90+ figure covers languages available in Gemini Live consumer mode, where conversation is more forgiving than, say, capturing a caller’s address or appointment details on a phone line. For products like Safina that handle real business calls, accuracy on names, numbers, and specific requests is what matters most.

Multimodal Conversations: Voice Plus Screen

One feature that sets Flash Live apart from phone-focused voice models: it can process visual input during a conversation. If you’re using Gemini Live on a phone or laptop, the model can see your screen or webcam feed while talking to you.

This opens up use cases that pure audio models can’t touch. Walking someone through a software interface while they share their screen. Discussing a document that both parties can see. Helping a user troubleshoot hardware by looking at it through the camera.

For phone calls, though, none of this applies. Phone calls are audio-only. A caller dialing a business number won’t be sharing their screen. The multimodal capability is interesting as a technology signal (voice plus vision is where consumer AI is heading), but it doesn’t change the phone assistant equation.

What does carry over is the acoustic understanding. Flash Live detects “acoustic nuance,” which Google describes as the ability to pick up on tone, emotion, and emphasis in the caller’s voice. That matters on the phone. Knowing whether a caller sounds frustrated versus relaxed changes how a good assistant should respond. This is a capability that Cartesia’s Sonic 3 approaches from the output side (generating emotional speech), while Flash Live approaches it from the input side (understanding emotional speech).

Function Calling and Agent Capabilities

Flash Live scored 90.8% on ComplexFuncBench Audio, a benchmark that tests whether a voice model can correctly call functions based on spoken instructions. That’s the highest score in the field as of March 2026.

Function calling is what turns a voice model from a chatbot into an agent. Instead of just generating spoken answers, the model can take actions: check a calendar, look up an order, book an appointment, transfer a call. The caller says “move my Thursday appointment to Friday afternoon” and the model doesn’t just confirm it heard the request. It calls the scheduling API and makes the change.

For AI phone assistants, this is the most relevant capability. Phone calls are conversations with a purpose. The caller wants something done, not just discussed. A model that excels at understanding spoken requests and mapping them to the right function call is exactly what phone agents need.

Google is giving developers access through the Gemini Live API in Google AI Studio. The API supports function declarations, so developers can define what actions the model can take and the model handles the mapping from natural speech to structured function calls.

This is an area where the entire voice agent ecosystem is competing hard. OpenAI’s Realtime API has function calling. Anthropic’s Claude models support tool use. Google’s 90.8% benchmark score suggests they’re currently ahead on the specific challenge of function calling from spoken audio.

What This Means for AI Phone Assistants

Flash Live pushes the field forward in three areas that matter for phone AI.

Language coverage sets a new bar. Ninety languages forces every other voice AI provider to respond. For businesses that operate across borders or serve multilingual populations, the gap between 30 languages and 90 languages is the gap between “we cover our main markets” and “we cover everywhere.” As the benchmark rises, dedicated phone products need to keep expanding language support to stay competitive.

Audio-native models are becoming the standard. The traditional pipeline of STT, LLM, and TTS is being replaced by models that work directly with audio. Flash Live, OpenAI’s Realtime API, and others are all moving in this direction. Products built on the old pipeline will feel the latency gap. Safina’s architecture already prioritizes low latency, but the underlying model technology is shifting.

General-purpose versus phone-specific remains the divide. Flash Live is designed for broad conversational AI, with screen sharing, webcam integration, and consumer chat. That’s different from what a phone assistant needs: reliable call handling, accurate information capture, CRM integration, caller greeting by name, appointment booking, and dozens of business-specific workflows. Google is building the engine. Products like Safina build the car.

The companies that win in phone AI won’t just have the best model. They’ll have the best integration between the model and everything around it: telephony, business data, caller context, and follow-up actions. Flash Live is a powerful engine, and it raises the performance floor for everyone. The question for businesses is whether they need a general-purpose voice AI or a phone assistant purpose-built for their calls.

For a broader look at the current state of voice AI and how these technologies compare, see our voice agent landscape overview for 2026 and our comparison of AI solutions.