Voice AI — Ch. 19

I was on a call with a client’s support team last month. They were drowning. Thirty incoming calls an hour, two people answering phones, hold times averaging twelve minutes.

We set up a voice agent as a proof of concept. Deepgram for speech-to-text, Claude Sonnet for reasoning, Cartesia for text-to-speech, LiveKit for the plumbing. The agent answers the phone, understands what the caller wants, checks their order status through an MCP connection to the database, and either resolves the issue or transfers to a human with a full summary.

Three days to build the prototype. The agent handles about 60% of calls without human intervention. Hold time dropped to under a minute.

The convincing moment wasn’t the speed or cost savings. It was a recording of the agent handling a frustrated caller who’d been tracking a package for three days. The agent checked the database, found the tracking number, explained the delay (customs hold), offered to send an SMS when it cleared. The caller said “thank you, that’s all I needed.” Forty seconds. No escalation. The client’s operations manager asked “how soon can this be live?”

What Voice AI is

Five technologies working together: text-to-speech (making AI sound like a person), speech-to-text (understanding what you’re saying), voice cloning (reproducing a specific voice from as little as 3-10 seconds of audio), real-time conversational voice (full-duplex dialogue with interruptions and natural pauses), and voice agents (autonomous programs handling phone calls end-to-end). Each matured independently. In 2026, they’ve converged.

The players

ElevenLabs is the voice quality leader. Their Conversational AI platform combines TTS, STT, and an LLM brain with phone connectivity, custom knowledge bases, and tool calling. Voices sound natural enough that callers can’t reliably tell it’s AI. The voice cloning capability — Professional Voice from 30 minutes of audio, or Instant Voice from seconds — raises real ethical questions the industry hasn’t fully resolved. $5/month for creators, $22/month for teams.

Retell AI is the fastest way to deploy a phone-answering voice agent. Pre-built templates for appointment scheduling, FAQ handling, lead qualification. No coding required for basic setups. $0.07-0.15 per minute. If your first question is “I need an AI that answers my business phone,” Retell is the starting point.

Deepgram leads in speech-to-text with under 300ms latency and competitive accuracy at $0.0036-0.0048 per minute. AssemblyAI is the strongest for structured understanding — speaker diarization, sentiment, topic detection, PII redaction. Cartesia builds the fastest text-to-speech engine, under 50ms generation with streaming, licensed by other voice AI companies.

OpenAI’s Realtime API is the end-to-end alternative — audio in, audio out, no intermediate text conversion. Simpler architecture, more natural for casual voice but less controllable for business applications where you need to see and log the text.

LiveKit is the open-source infrastructure layer. It handles the WebRTC, room management, and media routing. Most production voice agents in 2026 are built on LiveKit underneath, whether they know it or not.

How voice agents work

Two architectures: the pipeline approach (STT → LLM → TTS, three separate stages) and the end-to-end approach (speech-to-speech, one model handles everything).

Pipeline wins on control, logging, and flexibility — you can swap any component. You see the text at every stage. You can log, filter, and inject business rules between steps. End-to-end wins on naturalness and latency, but you lose visibility into what’s happening.

For business applications, pipeline is the right choice. You need transcripts for compliance. You need to see what the AI understood before it responded. You need the ability to swap providers when prices change.

Voice cloning: powerful and dangerous

Voice cloning crossed a line in 2025-2026. ElevenLabs can reproduce a voice from seconds of audio. A teenager cloned a school principal’s voice from a graduation speech on YouTube and posted fabricated audio — it took the school district a week to confirm the audio was fake.

The platforms add consent verification and watermarking, but enforcement is imperfect. The technology exists regardless of safeguards. For legitimate use — branded customer service voices, podcast production, accessibility — it’s transformative. For fraud, extortion, and impersonation, it’s a weapon. If your business involves voice communication, you need a deepfake awareness plan.

Open source

Piper is the leading open-source TTS engine. Runs locally with zero cloud dependency. Quality is below ElevenLabs but improving fast, and the privacy guarantee (nothing leaves your machine) matters for healthcare, legal, and financial applications.

Whisper (OpenAI’s open-source STT model) is the reference point for on-device transcription. WhisperKit optimizes it for Apple Silicon. For teams that can’t send audio to cloud providers, the local stack is viable.

The bottom line

Voice AI crossed a threshold. Quality is good enough that callers can’t tell. Latency is low enough for natural conversation. Cost is low enough for real deployment. Platforms are mature enough that you don’t need ML engineers to ship.

If your product involves phone calls, support, scheduling, or any workflow where someone waits on hold — voice agents are ready. Now.

The hardest part isn’t the technology. It’s designing conversations that feel helpful rather than frustrating. People have decades of learned hatred for automated phone systems. Your agent has about ten seconds to prove it’s different from the IVR menu that’s wasted their time since 1995. Make those ten seconds count.

This is the free web edition of Chapter 19. The full text — with voice agent architecture diagrams, LiveKit pipeline configurations, ElevenLabs integration walkthroughs, and voice cloning safety protocols — is available in 42: The AI Builder’s Stack, coming Q3 2026 on Amazon in hardcover, paperback, and digital.