Pipeline vs Speech-to-Speech: Why Production Voice Agents Still Choose the Hard Way
Tom Shapland, PM at LiveKit
Building a voice agent seems like it should be simple. Your user speaks. The AI speaks back. In 2024, that could mean a single model handling both: a speech-to-speech system like OpenAI’s GPT-4 real-time that converts voice directly to voice, skipping the middle steps.
But here’s what Tom Shapland, PM at LiveKit (the platform powering OpenAI’s voice mode), has learned from watching thousands of production voice agents: the simpler path breaks in production.
The gap between “working” and “production-grade” reveals itself in two specific areas that speech-to-speech systems can’t handle: tool calling and control.
Why Tool Calling Breaks Speech-to-Speech Models
When you build a voice agent for real business use cases, the agent needs to do things. It needs to look up a database. Call an API. Trigger an action. That’s “tool calling,” and it’s where pipeline architectures win decisively.
“Speech-to-speech models aren’t very good at tool calling,” Tom explains. “Text-based LLMs are much better at tool calling.”
The reason matters. When you use a speech-to-speech model, the AI is working entirely in the speech domain. It’s optimized to sound natural and conversational. But tool calling requires symbolic reasoning — the agent needs to understand that it should invoke a specific function, pass the right parameters, and handle the response. Speech models trained to sound human aren’t incentivized to do that with precision.
Text-based LLMs, by contrast, have been fine-tuned on millions of examples where they call functions with exact syntax and parameters. They understand the difference between “let me add this to your calendar” and actually parsing the date, extracting it, and making the API call.
For most revenue-generating voice agents in production — lead qualification, customer support, sales assistance — tool calling isn’t optional. It’s the point of the interaction.
The Control Problem: Prompting, Customization, and Hallucination
The second issue is control. When you use a speech-to-speech model as a black box, you lose the ability to guide what the AI says.
With a pipeline architecture, you control every layer. You can prompt the LLM with specific instructions. You can customize how words are transcribed by prompting the speech-to-text model with domain vocabulary. You can add constraints at each step. If your voice agent is making errors, you can debug which layer is the problem.
“If you’re using a speech-to-speech model, you don’t have as much control over how certain words are transcribed,” Tom says. But with newer speech-to-text models, “you can prompt those models where you can tell them ahead of time, this is what the user is talking about, and here are some of the words that they will be using.”
This matters enormously for accuracy. A healthcare voice agent needs to transcribe medical terms correctly. A financial services bot needs to parse account numbers and transaction types without error. A pipeline lets you layer in that precision. A speech-to-speech model gives you a take-it-or-leave-it experience.
When Speech-to-Speech Wins (And When It Doesn’t)
Speech-to-speech models do have a place. Consumer apps like Gemini and ChatGPT on mobile use them. These are low-stakes interactions where the user is just chatting. There’s no objective to achieve beyond conversation. No data to look up. No actions to trigger.
But there’s an important nuance: even within consumer apps, the most sophisticated voice agents use the pipeline. Tom’s example: “Our customer ProTola builds Tolens, which are consumer voice agents where people talk to this character. That character comes to know you over time and actually has objectives with this conversation with you. And so those are very sophisticated agents. To build agents that sophisticated, they’re not using speech to speech models. They’re using the pipeline approach.”
The pattern is clear. Speech-to-speech is good enough for chat. Pipelines are necessary for control.
The Real Cost of Simplicity
A naive pipeline looks like: Whisper for transcription, Claude for reasoning, ElevenLabs for speech synthesis, WebSocket for streaming, plus voice activity detection for knowing when to listen.
That works. Sort of. You’ll get something running in a weekend. But you’ll also hit latency problems, packet loss, broken turn-taking, and orchestration headaches that only show up when multiple users start interacting with your agent.
“You can build that approach, and you’ll get something working,” Tom says. “But it will be a little bit brittle and you’ll likely have really high latency.”
The problem isn’t the components. The problem is the coordination between them — handling cases like preemptive generation (starting the LLM response before the user finishes speaking to save latency), managing edge cases in turn detection, handling model failures gracefully.
That’s why frameworks exist. Not to replace your thinking, but to handle the 47 edge cases you’ll discover in month two of production.
FAQ
What’s the main reason production voice agents don’t use speech-to-speech models?
Speech-to-speech models can’t reliably call tools or access external data — they’re optimized for natural conversation, not symbolic reasoning. Production voice agents need tool calling for lookup, APIs, and workflow control.
Should I build voice agents with a pipeline architecture?
For business use cases (lead qualification, customer support, sales), yes. Pipelines let you control each layer and add tool calling. For consumer chat applications, speech-to-speech models may be sufficient, but sophisticated consumer agents still use pipelines.
What’s the downside of speech-to-speech simplicity?
You lose control. You can’t customize transcription, prompt the reasoning layer, or debug individual components. When accuracy matters — medical terms, financial data, domain-specific vocabulary — a pipeline lets you layer in guardrails.
Can I start with a pipeline approach if I’m new to voice agents?
Use a framework instead of building from scratch. Frameworks handle the edge cases (preemptive generation, turn detection, session management) that a naive pipeline misses. LiveKit’s agents framework, for example, gives you the toolkit plus the coordination layer.
Does the pipeline approach have latency tradeoffs?
It can, but not necessarily. The trick is preemptive generation — starting the LLM response before the user finishes speaking, in parallel with transcription. This removes latency from the critical path.
Which models should I use in my pipeline?
Whisper for transcription (or newer STT models that support prompting), an LLM like Claude for reasoning, a TTS model like ElevenLabs or Edge TTS for speech synthesis. Use WebRTC for audio transport (not WebSockets) for real-time reliability.
Why does OpenAI use speech-to-speech for ChatGPT on mobile but LiveKit powers their server-side infrastructure?
OpenAI’s mobile experience is consumer-focused (chat). Their server infrastructure needs to handle tool calling, API integration, and scalability for production workflows. Different constraints, different architectures.
Is the pipeline approach getting harder or easier as models improve?
Easier. New speech-to-text models support prompting (context injection), making transcription more accurate. New LLMs are better at tool calling. The real complexity is orchestration — knowing when each component should run and handling the states in between.
Can I use both approaches in one system?
Yes, but rarely. You might use speech-to-speech for initial intake and then switch to a pipeline when the agent needs to call tools or access data. This adds complexity without much benefit.
When will speech-to-speech models be good enough for production?
When they can handle tool calling with the same precision as text-based LLMs, and when they offer the customization and control that regulated industries require. The models are improving, but the gap remains significant.
Full episode coming soon
This conversation with Tom Shapland is on its way. Check out other episodes in the meantime.
Visit the Channel