Cartesia Sonic 2: The TTS Model Behind Natural AI Phone Voices

Cartesia Sonic 2 delivers 90ms latency text-to-speech with 15 languages and voice cloning. Here's why it matters for AI phone assistants.

David Schemm David Schemm

In March 2025, Cartesia shipped Sonic 2 and announced a $64 million Series A. For most people, that was a funding headline. For anyone building voice AI products, it was the moment phone calls started sounding different.

Sonic 2 is a text-to-speech model. It turns text into spoken audio. That description undersells what changed. Before Sonic 2, the best TTS options for phone conversations forced a tradeoff: you could have a natural voice or a fast response, but getting both at once meant paying a steep price in one direction or the other.

Cartesia built a model that does both. And it matters more for phone calls than for any other use case.

Cartesia Sonic 2: The TTS Model That Changed How AI Sounds on the Phone

Cartesia is a San Francisco-based company that builds real-time AI models. Their TTS line, called Sonic, has focused on low-latency voice generation since its first version. Sonic 2 launched in March 2025 as a major overhaul, not an incremental update.

The technical architecture is different from what most TTS providers use. ElevenLabs, OpenAI, and Play.HT all rely on transformer-based architectures, the same family of models behind GPT and similar language models. Cartesia went a different route with state space models (SSMs). These handle sequential data, like audio, differently than transformers. Instead of processing the entire sequence at once and attending to every previous token, SSMs maintain a compressed state that gets updated with each new input.

The practical result: faster generation with less compute. Transformers scale poorly for long sequences because attention costs grow quadratically. SSMs avoid that bottleneck, which is why Cartesia can produce speech at the speeds they do.

In blind evaluations, Sonic 2 showed a 1.5x preference rate over the next best provider. That number came from Cartesia’s own testing, so take it with appropriate skepticism. But the quality difference is noticeable when you hear it. The voices sound less like TTS and more like a person reading naturally, with the right pacing and emphasis.

Cartesia charges $46.70 per million characters. For a deeper look at how different providers compare on cost and quality, we put together a full comparison guide.

Why 90 Milliseconds Changes Everything

Here is where the phone call context becomes critical.

When you are reading a blog post, the TTS latency does not matter. An audiobook can take a full second to start playing and nobody cares. But on a phone call, every millisecond of silence after someone stops talking creates awkwardness. Humans are sensitive to conversational pauses. Research on turn-taking in conversation shows that gaps longer than about 200 milliseconds start feeling unnatural.

Traditional TTS systems run between 200 and 500 milliseconds of latency. That is the time from receiving the text to producing the first audio bytes. At 300ms of TTS latency, combined with speech recognition time and LLM processing, the total delay in an AI phone call easily reaches 800ms to a full second. Callers notice. They start repeating themselves, talking over the AI, or hanging up.

Sonic 2 brings model latency down to 90 milliseconds. The Sonic Turbo variant hits 40ms. At sub-100ms TTS latency, the TTS stage nearly disappears from the total delay budget. The caller hears a response coming back at a pace that feels normal.

This is not about specs on a benchmark. It is about whether a caller feels like they are talking to a responsive system or waiting for a machine to catch up. At 90ms, most people stop noticing the TTS delay entirely. The conversation flows the way phone conversations should.

For a technical look at how this fits into the full voice pipeline, including speech recognition and LLM inference, read our breakdown of Safina’s TTS approach.

Voice Cloning and 15 Languages

Sonic 2 launched with support for 15 languages: English, French, German, Spanish, Portuguese, Chinese, Japanese, Hindi, Italian, Korean, Dutch, Polish, Russian, Swedish, and Turkish. That is a wide net for a model that maintains quality across all of them.

Voice cloning works from just 3 seconds of audio. You record a short sample, and the model can generate speech in that voice across any of the supported languages. For businesses, this means a company can maintain a consistent brand voice across international markets without recording separate voice talent for each language.

The multilingual angle matters for AI phone assistants specifically. A business in Berlin might field calls in German, English, and Turkish. A company in Miami handles English and Spanish daily. Being able to respond in the caller’s language, with natural pronunciation and the same voice identity, changes how callers experience automated phone systems.

We have written separately about why multilingual support is a big deal for AI phone assistants and the broader voice AI landscape in 2026.

What This Means for AI Phone Assistants

Phone calls are audio-only. There is no visual UI, no chat bubble, no loading spinner. The voice IS the entire product experience. When that voice sounds flat, robotic, or slow, callers lose trust fast. When it sounds natural and responsive, they engage with the system the way they would with a human.

This is why TTS quality is not a nice-to-have for phone assistants. It is the core of whether the product works.

Low latency creates natural conversation flow. Callers ask questions, and the response starts before the pause gets uncomfortable. Good prosody (the rhythm and intonation of speech) means the AI sounds like it understands what it is saying, not just reading words aloud. Voice cloning means a business can have its AI assistant match the warm, professional tone the brand has built over years.

Products like Safina use TTS as the final stage in a pipeline that includes speech recognition, language model processing, and audio generation. Each stage adds latency. When the TTS model can do its part in under 100 milliseconds, the total response time stays within the range that feels conversational.

The psychology behind AI voice quality shows that callers make trust judgments within the first few seconds. A voice that sounds human keeps people engaged. A voice that sounds like a machine gives them a reason to hang up.

Cartesia Sonic 2 did not invent good TTS. But it raised the bar for what “good enough for phone calls” means. 90ms latency, strong multilingual support, and voice cloning from a 3-second sample: that combination, at $46.70 per million characters, changed the economics and quality threshold for anyone building voice AI products that talk to real people on real phone lines.

Sources

9:41

Safina handled 51 calls this week

46

Trustworthy

4

Suspicious

1

Dangerous

Last 7 days
Filter
EM
Emma Martin 67s 15:30

Wants to discuss the offer for the new campaign and has questions about the timeline.

LS
Laura Smith 54s 14:45

Asking about the order status and when the delivery arrives.

TH
Tim Miller 34s 13:10

Schedule a meeting for the project discussion next week.

Unknown 44s 11:30

Prize promise – probably spam.

SK
Sarah King 10s 09:15

Complaint about the last order, asks for a callback.

MM
Mike Mitchell 95s Dec 13

Wants to discuss a potential collaboration.

AR
Amy Roberts 85s Dec 13

Is your colleague and wants to discuss the project.

JK
Jack Kennedy 42s Dec 12

Asking about available appointments next week.

LB
Lisa Brown 68s Dec 12

Has questions about the invoice and asks for clarification.

Calls
Safina
Contacts
Profile
9:41
Call from Emma Martin
Dec 12
11:30
67s
+12125551234

Wants to discuss the offer for the new campaign and has questions about the timeline.

Key points

  • Call back Emma Martin
  • Clarify timeline & pricing questions
Call back
Edit contact

AI Insights

Caller mood Very good

The caller was cooperative and provided the needed information.

Urgency Low

The caller can wait for a response.

Audio & Transcript

0:16

Hello, this is Safina AI, Peter's digital assistant. How can I help you?

Hi Safina, this is Emma Martin. I wanted to discuss the offer and the timeline.

Thanks, Emma. Are you mainly deciding between the Standard and Pro package for the launch?

Exactly. We need the Pro package and would like to start next month if onboarding is possible in week one.

Say goodbye to your old-fashioned voicemail.

Try Safina for free and start managing your calls intelligently.