Why Turn Detection Is the Biggest Problem in Voice AI (And H

You’re talking to a voice agent. The AI responds to your question. But instead of waiting for you to stop speaking, it interrupts mid-sentence. Or it waits too long, creating an awkward silence. Your frustration grows. The conversation breaks.

This is the number one failure mode of voice agents in production.

Tom Shapland, PM at LiveKit, describes it directly: “The biggest problem in voice agents currently, and it has been this since LLM-based voice agents have come to the scene, has been turn detection. Because the conversation breaks down and the user gets frustrated and the call doesn’t go to its objective when the agent and humans start talking over each other.”

Turn detection isn’t sexy. It’s not a new model or a breakthrough algorithm. But it’s the difference between a voice agent that feels natural and one that makes users want to hang up.

What Turn Detection Actually Does

Turn detection solves one fundamental problem: knowing when the human is done speaking so the AI can start its response.

It sounds simple. It’s not.

The challenge is that silence doesn’t mean someone is finished talking. A thoughtful person pauses while forming their response. A non-native speaker takes time between sentences. A child thinks before answering. And different people have different patterns.

“People have different styles of talking to each other,” Tom explains. “Part of what you and I do when we talk is when we first start talking, we’re doing some sort of calibration with each other and know how long to wait.”

Humans handle this through paralinguistic cues — the tone, pacing, breathing patterns, and semantic context that signal someone’s intent to yield the floor. They also dynamically adjust. “Even midway through a conversation, I will dynamically calibrate and be like, this is actually kind of a more thoughtful topic. And so I should maybe give her more space for thinking of her response.”

A voice agent trying to detect turns via simple silence thresholds has none of this. It either cuts people off or waits awkwardly.

Why It Fails with Kids (And Why That Matters)

Tom’s daughter uses ChatGPT’s voice mode. “When she talks with ChatGPT using the voice mode, it was actually a terrible experience. The way kids talk and the way they pause and the way they interrupt are more unpredictable than adults. She’s thinking and then ChatGPT just jumped in, and then they started arguing with each other.”

This happens because consumer voice models are trained on typical adult conversation patterns. Kids have different patterns. So do people from different cultures. Non-native speakers. People with processing differences.

A production voice agent can’t assume a single conversation style. It needs to handle variation. That’s why turn detection matters: it’s the knob that lets you dial in the right behavior for your use case.

The LiveKit Approach: Flexibility Over Defaults

This is where many voice agent platforms fail. They bake in a single turn-detection strategy and force everyone to live with it. The result: works great for some users, terrible for others.

LiveKit’s approach is different. “We aim to have it default out of the box, the turn taking to be as good as possible,” Tom says. “We recognize that there’s different use cases and that you have the flexibility to parameterize the models and build your agent in a way that is going to work for your use case and the different turn taking characteristics of your use case.”

In practice, this means:

You can customize turn-taking behavior. Adjust how long the agent waits before assuming the user is done. Tighten thresholds for faster, more aggressive turn-taking. Loosen them for thoughtful conversations.

You can layer in multiple signals. Not just silence duration, but semantic signals (did the LLM’s response indicate the user should answer?), prosodic signals (tone and pacing), and explicit model outputs trained to predict turn transitions.

You can evolve it dynamically. Some agents adapt their turn-taking mid-conversation based on user patterns.

When Turn Detection Breaks Down (And What That Costs)

Turn detection failures don’t just feel awkward. They derail the entire interaction.

In a lead qualification call, overlapping speech creates confusion about what was actually qualified. In customer support, it breaks the rhythm of problem-solving. In sales calls, it kills trust. In emergency services (where LiveKit powers 911 systems), it can be life-or-death.

The cost of getting it wrong isn’t just a poor user experience. It’s a failed objective. “The conversation breaks down and the user gets frustrated and the call doesn’t go to its objective,” Tom says.

The Model Stack Behind Turn Detection

Turn detection isn’t a single model. It’s a suite of what Tom calls “paralinguistic models” — models trained on speech features beyond the words themselves.

These include:

Speech activity detection — distinguishing speech from background noise or silence.

Voice activity prediction — predicting whether the user is about to speak or is pausing mid-thought.

Turn-taking models — trained specifically to predict the end of a speaking turn based on acoustic and linguistic features.

Semantic models — understanding the content of what was said to infer whether the user is expecting a response.

Most teams building voice agents start by stitching together off-the-shelf components (Whisper for transcription, Claude for reasoning, ElevenLabs for speech synthesis). That’s fine. But turn detection is one piece where you can’t ignore the framework. As Tom notes: “It just makes sense to start with a framework, ideally an open source framework where you can actually see what the code is doing. It’s not a black box.”

FAQ

What exactly is turn detection in voice agents?

Turn detection is the ability to know when a human has finished speaking and it’s time for the agent to respond. It handles variable pauses, thinking time, and different speaking styles so the conversation feels natural instead of awkward.

Why does ChatGPT interrupt kids more than adults?

ChatGPT’s turn detection is trained on typical adult conversation patterns. Kids pause differently and more unpredictably. The model doesn’t adapt, so it triggers too early and cuts off the child’s thinking.

Can I just use a silence threshold for turn detection?

No. A pure silence threshold cuts off thoughtful speakers and creates awkward delays for fast talkers. You need models trained on paralinguistic features — tone, pacing, speech activity — plus semantic understanding of conversation flow.

What’s a paralinguistic model?

A model trained on speech features beyond just the words: tone, pacing, breathing, pitch changes, and timing patterns. These features signal conversational intent the way words alone can’t.

How do I customize turn detection for my use case?

Use a framework that exposes turn-detection parameters. You can adjust silence thresholds, add semantic signals (does the question require a response?), and even build custom turn-detection logic for specialized domains.

Does turn detection add latency?

Not if it’s built into the framework. The overhead is milliseconds. The latency cost of getting it wrong — failed objectives, repeated clarifications, user frustration — is much larger.

Why is turn detection harder for voice agents than for humans?

Humans use context, relationship history, cultural norms, body language, and years of social calibration. Voice agents have only the audio signal and shallow semantic understanding. They’re working with 5% of the information humans use.

Can turn detection be fixed by improving the underlying models?

Partially. Better speech models and LLMs help. But the core issue is architectural: you need models specifically trained to detect turn-taking, and you need frameworks that let you tune them per use case.

What happens when turn detection fails in a business call?

Objectives don’t get met. A lead qualification call becomes unclear. A customer support interaction breaks down. Trust erodes. The agent needs to apologize or repeat itself. The call duration extends and resolution rate drops.

Is turn detection the only thing that breaks voice agents?

No. Hallucinations, tool-calling failures, and audio quality matter too. But turn detection is the most common failure mode because it’s the foundation of natural conversation. If turns break, everything else falls apart.

Why Turn Detection Is the Biggest Problem in Voice AI (And How to Fix It)

What Turn Detection Actually Does

Why It Fails with Kids (And Why That Matters)

The LiveKit Approach: Flexibility Over Defaults

When Turn Detection Breaks Down (And What That Costs)

The Model Stack Behind Turn Detection

FAQ

More from Tom Shapland

Related Insights

Pipeline vs Speech-to-Speech: Why Production Voice Agents Still Choose the Hard Way

When AI Agent Failures Are Actually Data Problems

AI Agents Are Distributed State Machines — What That Means for How You Build Them