Cartesia just shipped Sonic 3, and the upgrade is hard to ignore. Their previous model, Sonic 2, was already one of the fastest TTS engines available, with latency numbers that made it a favorite for real-time voice applications. Sonic 3 keeps that speed (sub-100ms model latency) while adding two things the voice AI industry has been waiting for: expressive emotion and broad language coverage.
The snapshot (sonic-3-2025-10-27) landed in late 2025. If you’ve worked with Sonic 2, the jump is significant. Language support went from 15 to over 40, covering roughly 95% of the world’s population. And the model can now laugh, express concern, convey warmth, and adjust tone through SSML tags and API parameters. That’s a different product.
Here’s what changed and why it matters, especially for anyone building voice AI that talks to real people on the phone.
Emotion in Voice: Why It Matters on the Phone
Phone calls are not text chats. When someone calls a business, they bring their emotional state with them. A frustrated customer calling about a billing error. An anxious patient calling a medical office. A new lead calling with genuine excitement about a product.
The voice on the other end of the line sets the tone for the entire interaction. Research consistently shows that vocal warmth and appropriate emotional mirroring increase trust and caller satisfaction. A flat, monotone response to “I’m really worried about this” makes callers feel unheard.
Sonic 3 introduces the ability to match emotional context. The model can express:
- Warmth and empathy when a caller is distressed
- Enthusiasm when delivering good news or positive information
- Calm reassurance for anxious callers
- Natural laughter during lighter moments in conversation
This isn’t about faking emotions. It’s about not sounding dead when the conversation calls for something human. The SSML controls let developers adjust volume, speed, and emotional tone at the sentence level. So a single response can shift from informative to reassuring depending on the content.
For AI phone assistants, this closes a gap that has bothered anyone who has listened to their own system handle a call. The words might be right, but the delivery falls flat. Sonic 3 gives developers the tools to fix that.
40+ Languages: From 15 to Global Coverage
Sonic 2 supported 15 languages. Enough for major European markets and a few others, but it left gaps. Sonic 3 pushes that number past 40, with some notable additions.
The biggest expansion is in South Asian languages. Nine Indian languages are now supported, which opens doors for businesses serving India’s 1.4 billion people. Hindi, Bengali, Tamil, Telugu, Gujarati, Kannada, Malayalam, Marathi, and Punjabi are all in the list. For companies with customer bases in India (or Indian diaspora communities worldwide), this is a practical change.
Beyond South Asia, the model adds coverage across East Asia, Southeast Asia, the Middle East, and Africa. Combined with existing European language support, the 40+ figure covers most of the commercially relevant language markets on the planet.
What does this mean in practice? A single TTS provider can now handle calls in German, Mandarin, Hindi, Arabic, and Portuguese without switching engines. For products that serve multilingual markets (which, in 2026, means most products), the operational simplification is real. No more stitching together different providers for different languages, each with its own voice characteristics and latency profiles.
If you’re interested in how multilingual voice support works in production phone systems, we’ve written about the challenges of multilingual AI phone assistants before.
Enterprise-Ready Compliance
Compliance certifications are one of those things nobody talks about until they need them. Then they become deal-breakers.
Sonic 3 ships with SOC 2 Type II, HIPAA, and PCI Level 1 compliance. That combination covers the three areas where phone AI tends to hit regulatory walls:
- SOC 2 Type II proves that Cartesia’s systems handle data securely over time, not just at a single audit point
- HIPAA opens the door for healthcare applications where patient information flows through the TTS pipeline
- PCI Level 1 means payment-related conversations (reading back order totals, confirming credit card transactions) are handled at the highest security standard
For companies in healthcare, finance, or any regulated industry, these certifications mean Sonic 3 can be evaluated without the legal team immediately saying no. That’s not a small thing. Many otherwise excellent TTS providers stall in enterprise sales cycles because they lack one or more of these certifications.
Voice Cloning Gets Faster
Sonic 3 also updates the voice cloning story. Instant cloning now works with just 10 seconds of reference audio. Record a short sample, and the model generates a clone that captures the speaker’s characteristics.
For businesses that want their AI phone assistant to sound like a specific person (a founder, a brand spokesperson, a receptionist whose voice callers already know), this lowers the barrier. Previous approaches required longer samples or professional recording sessions. Ten seconds is something you can capture on a smartphone.
For higher-fidelity needs, Cartesia still offers a professional voice cloning option with more extensive input. But the 10-second instant path makes experimentation easy.
What This Means for AI Phone Assistants
Every improvement in TTS feeds directly into the quality of AI phone calls. And Sonic 3 hits the areas that matter most for phone applications.
Emotion changes how callers respond. When a voice can express appropriate concern or warmth, callers are more comfortable. They stay on the line longer, share more information, and leave conversations with a better impression. For products like Safina that handle real business calls, this translates directly into better outcomes, more captured leads, happier callers, and fewer complaints about “talking to a robot.”
More languages mean more markets. An AI phone assistant that only speaks 15 languages leaves money on the table. At 40+, the constraint shifts from “can we support this language?” to “should we enter this market?” That’s a better problem to have.
Compliance removes friction. Healthcare practices, law firms, and financial advisors can consider AI phone solutions without a months-long compliance review. The certifications are already in place.
We’ve covered how Safina approaches text-to-speech in our technical deep-dive series. Advances like Sonic 3 are exactly the kind of upstream improvement that makes the entire voice AI stack better. And if you want to see where TTS providers stack up against each other more broadly, our TTS comparison guide covers the field.
The psychology of AI voice matters more than most people realize. Sonic 3’s emotion features don’t just sound nice in a demo. They address a real gap between what callers expect and what most AI voices deliver today.
For a broader look at where voice AI is headed, see our voice agent landscape overview for 2026.