Most AI voice systems today work like a relay race. Audio comes in, gets transcribed to text, the text goes to a language model, the response gets synthesized back into speech, and the audio goes out. Three separate models, three handoffs, three chances to add latency and lose information.
OpenAI’s GPT-Realtime takes a different approach. A single model processes incoming audio and produces outgoing audio directly. No transcription step. No text-to-speech step. One model, end to end. The Realtime API is now generally available for production use, and the implications for phone-based voice agents are worth examining closely.
What Speech-to-Speech Actually Changes
The traditional voice AI pipeline looks like this: Speech-to-Text (STT) converts the caller’s words into text. A large language model (LLM) reads that text and generates a response. Text-to-Speech (TTS) turns the response back into audio. Each step takes time. STT adds 100-300ms. The LLM adds its own processing time. TTS adds another 100-300ms. Total round-trip latency lands somewhere between 1 and 2 seconds.
But latency is only half the problem. Each handoff loses information. When audio gets transcribed, the tone is gone. The hesitation in someone’s voice. The frustration. The relief. The transcription says “okay” whether the caller said it with enthusiasm or resignation. The LLM responds to the word, not the feeling. And TTS generates a response in whatever voice profile it was given, disconnected from the emotional context of the conversation.
GPT-Realtime processes the audio signal directly. The model hears the caller’s tone, pace, and emotion, and generates a response that accounts for all of it. The output audio carries its own appropriate tone. OpenAI reports end-to-end latency of 250-500ms, which puts responses inside the window where conversations feel natural.
For anyone who has listened to AI phone calls, you know the difference between a 500ms pause and a 1.5-second pause. The first feels like a thoughtful person. The second feels like talking to a machine.
The Old Way vs. Realtime: Why Architecture Matters
Here’s what happens at each latency level during a phone call:
Under 500ms: The conversation flows. The caller barely notices any delay. It feels like talking to a person who thinks before speaking, which is normal.
500ms to 1 second: Noticeable but tolerable. Callers start to register that something is different. They might slow down their own speech or pause longer between sentences to compensate.
Over 1 second: The conversation breaks down. Callers start talking over the AI. They repeat themselves. They get frustrated. Some hang up.
The traditional pipeline (STT + LLM + TTS) typically lands in the 1-2 second range. Good implementations with optimized models and streaming can push it under a second. GPT-Realtime’s 250-500ms target puts it in the “feels like talking to a person” category.
There’s another advantage beyond raw speed. Because the model processes audio natively, it can pick up on things that transcription misses. A sigh. A laugh. A shift in speaking pace that signals confusion. These signals shape how the model responds, both in content and in tone.
We’ve written about how Safina’s architecture handles this pipeline, including the specific choices around speech-to-text and text-to-speech. The pipeline approach offers its own advantages, which we’ll get to.
SIP Calling: AI on the Phone Network
One of the most practical additions to the Realtime API is SIP support. SIP (Session Initiation Protocol) is the standard that phone networks use to set up and manage calls. Supporting SIP means AI agents built on GPT-Realtime can make and receive actual phone calls through standard telephony infrastructure.
Before SIP support, connecting an AI voice agent to the phone network required middleware. You needed a telephony provider (like Twilio), a WebSocket bridge, and custom code to route audio between the phone network and the AI. It worked, but it added complexity, cost, and latency.
With native SIP support, the AI agent plugs directly into the phone system. Businesses can assign phone numbers, set up call routing, and handle inbound or outbound calls without building a telephony layer from scratch. For companies that want to automate phone interactions, this removes a significant engineering burden.
Benchmark Improvements
The latest GPT-Realtime model shows measurable gains over the December 2024 version across three areas that matter for phone applications:
Intelligence (BigBench Audio): 65.6% to 82.8%. The model understands what callers are saying and asking with higher accuracy.
Instruction Following (MultiChallenge Audio): 20.6% to 30.5%. When given specific instructions about how to handle calls (ask for a name, confirm an appointment, collect information), the model follows them more reliably.
Function Calling (ComplexFuncBench Audio): 49.7% to 66.5%. The model can trigger external actions (booking appointments, looking up records, sending notifications) based on the conversation.
OpenAI also introduced two new voices, Cedar and Marin, exclusive to the Realtime API. And there’s a cost-optimized variant called gpt-realtime-mini for applications where lower latency and reduced cost matter more than maximum capability.
One known limitation: the model sometimes misidentifies the language of speakers with heavy accents. For businesses serving diverse caller populations, this is worth testing.
Trade-offs: Realtime vs. Pipeline for Phone Assistants
GPT-Realtime is impressive, but it’s not the only valid approach to voice AI. The pipeline architecture (STT + LLM + TTS) that products like Safina use has real advantages.
Control over each component. In a pipeline, you can swap out any piece. Better STT model? Drop it in. New TTS voice you prefer? Switch it. Want to use a different LLM for certain types of calls? Route accordingly. With a single end-to-end model, you get what the model gives you.
Transparency. In a pipeline, you can inspect what happened at each stage. You can see the transcription, read the LLM’s reasoning, and evaluate the TTS output independently. With a speech-to-speech model, the intermediate steps are hidden inside the model. Debugging is harder.
Provider independence. A pipeline lets you mix providers. Use Deepgram for STT, Claude for reasoning, Cartesia for TTS. If any provider has an outage or raises prices, you swap that one piece. With an end-to-end model, you’re locked into a single provider for the entire voice experience.
Optimization per step. Each component in a pipeline can be individually optimized. You can use a faster STT model for simple queries and a more accurate one for complex ones. You can adjust TTS parameters based on the emotional context that the LLM identifies. This granular control is harder to achieve with a single model.
The realtime approach wins on latency and emotional continuity. The pipeline approach wins on flexibility and control. Both are valid. Where the industry is heading is likely some combination, with end-to-end models handling the fast path and pipeline components available for specialized needs.
For a broader view of how different companies are approaching voice AI, see our voice agent landscape overview for 2026. And for more on how voice quality affects caller perception, our piece on the psychology behind AI voice covers the research.