Cartesia Sonic 2: The TTS Model Behind Natural AI Phone Voices

In March 2025, Cartesia shipped Sonic 2 and announced a $64 million Series A. For most people, that was a funding headline. For anyone building voice AI products, it was the moment phone calls started sounding different.

Sonic 2 is a text-to-speech model. It turns text into spoken audio. That description undersells what changed. Before Sonic 2, the best TTS options for phone conversations forced a tradeoff: you could have a natural voice or a fast response, but getting both at once meant paying a steep price in one direction or the other.

Cartesia built a model that does both. And it matters more for phone calls than for any other use case.

Cartesia Sonic 2: The TTS Model That Changed How AI Sounds on the Phone

Cartesia is a San Francisco-based company that builds real-time AI models. Their TTS line, called Sonic, has focused on low-latency voice generation since its first version. Sonic 2 launched in March 2025 as a major overhaul, not an incremental update.

The technical architecture is different from what most TTS providers use. ElevenLabs, OpenAI, and Play.HT all rely on transformer-based architectures, the same family of models behind GPT and similar language models. Cartesia went a different route with state space models (SSMs). These handle sequential data, like audio, differently than transformers. Instead of processing the entire sequence at once and attending to every previous token, SSMs maintain a compressed state that gets updated with each new input.

The practical result: faster generation with less compute. Transformers scale poorly for long sequences because attention costs grow quadratically. SSMs avoid that bottleneck, which is why Cartesia can produce speech at the speeds they do.

In blind evaluations, Sonic 2 showed a 1.5x preference rate over the next best provider. That number came from Cartesia’s own testing, so take it with appropriate skepticism. But the quality difference is noticeable when you hear it. The voices sound less like TTS and more like a person reading naturally, with the right pacing and emphasis.

Cartesia charges $46.70 per million characters. For a deeper look at how different providers compare on cost and quality, we put together a full comparison guide.

Why 90 Milliseconds Changes Everything

Here is where the phone call context becomes critical.

When you are reading a blog post, the TTS latency does not matter. An audiobook can take a full second to start playing and nobody cares. But on a phone call, every millisecond of silence after someone stops talking creates awkwardness. Humans are sensitive to conversational pauses. Research on turn-taking in conversation shows that gaps longer than about 200 milliseconds start feeling unnatural.

Traditional TTS systems run between 200 and 500 milliseconds of latency. That is the time from receiving the text to producing the first audio bytes. At 300ms of TTS latency, combined with speech recognition time and LLM processing, the total delay in an AI phone call easily reaches 800ms to a full second. Callers notice. They start repeating themselves, talking over the AI, or hanging up.

Sonic 2 brings model latency down to 90 milliseconds. The Sonic Turbo variant hits 40ms. At sub-100ms TTS latency, the TTS stage nearly disappears from the total delay budget. The caller hears a response coming back at a pace that feels normal.

This is not about specs on a benchmark. It is about whether a caller feels like they are talking to a responsive system or waiting for a machine to catch up. At 90ms, most people stop noticing the TTS delay entirely. The conversation flows the way phone conversations should.

For a technical look at how this fits into the full voice pipeline, including speech recognition and LLM inference, read our breakdown of Safina’s TTS approach.

Voice Cloning and 15 Languages

Sonic 2 launched with support for 15 languages: English, French, German, Spanish, Portuguese, Chinese, Japanese, Hindi, Italian, Korean, Dutch, Polish, Russian, Swedish, and Turkish. That is a wide net for a model that maintains quality across all of them.

Voice cloning works from just 3 seconds of audio. You record a short sample, and the model can generate speech in that voice across any of the supported languages. For businesses, this means a company can maintain a consistent brand voice across international markets without recording separate voice talent for each language.

The multilingual angle matters for AI phone assistants specifically. A business in Berlin might field calls in German, English, and Turkish. A company in Miami handles English and Spanish daily. Being able to respond in the caller’s language, with natural pronunciation and the same voice identity, changes how callers experience automated phone systems.

We have written separately about why multilingual support is a big deal for AI phone assistants and the broader voice AI landscape in 2026.

What This Means for AI Phone Assistants

Phone calls are audio-only. There is no visual UI, no chat bubble, no loading spinner. The voice IS the entire product experience. When that voice sounds flat, robotic, or slow, callers lose trust fast. When it sounds natural and responsive, they engage with the system the way they would with a human.

This is why TTS quality is not a nice-to-have for phone assistants. It is the core of whether the product works.

Low latency creates natural conversation flow. Callers ask questions, and the response starts before the pause gets uncomfortable. Good prosody (the rhythm and intonation of speech) means the AI sounds like it understands what it is saying, not just reading words aloud. Voice cloning means a business can have its AI assistant match the warm, professional tone the brand has built over years.

Products like Safina use TTS as the final stage in a pipeline that includes speech recognition, language model processing, and audio generation. Each stage adds latency. When the TTS model can do its part in under 100 milliseconds, the total response time stays within the range that feels conversational.

The psychology behind AI voice quality shows that callers make trust judgments within the first few seconds. A voice that sounds human keeps people engaged. A voice that sounds like a machine gives them a reason to hang up.

Cartesia Sonic 2 did not invent good TTS. But it raised the bar for what “good enough for phone calls” means. 90ms latency, strong multilingual support, and voice cloning from a 3-second sample: that combination, at $46.70 per million characters, changed the economics and quality threshold for anyone building voice AI products that talk to real people on real phone lines.

Cartesia Sonic 2: The TTS Model Behind Natural AI Phone Voices

Cartesia Sonic 2: The TTS Model That Changed How AI Sounds on the Phone

Why 90 Milliseconds Changes Everything

Voice Cloning and 15 Languages

What This Means for AI Phone Assistants

Sources

Say goodbye to your old-fashioned voicemail.