Founder Insight

Why Interpretation Will Outlive Translation — And What Dies First

Olga Beregovaya, VP AI at Smartling

Listen on TL;Listen Prefer to listen? Hear this article read aloud.

One-to-one text translation is dying. But something else—something older, more human—will outlast it by decades. Olga Beregovaya, with 27 years watching language technology evolve, has a contrarian prediction: interpretation (voice-to-voice communication) will have a longer shelf life than translation, precisely because it’s more human.

This seems backward. After all, we’re all watching real-time translation get better. Voice interpreters seem like they should be the first to fall. But Olga has seen every cycle of language technology, and the pattern is clear: the thing that touches human conversation directly is the last to be automated away.

The Three-Step Pipeline Breaking Down

To understand why, start with how interpretation works. When someone speaks Russian in a conference, an AI system must:

  1. Transcribe speech to text (Automatic Speech Recognition — ASR)
  2. Translate text (Machine Translation)
  3. Convert text back to speech (Text-to-Speech synthesis)

Each step introduces points of failure. Olga explains: “You introduce three distinct point of failure, or if you look at the underlying models, 10 distinct points of failure. You need an acoustic model for ASR, a language model for translation, and a speech model for synthesis.”

The old systems were terrible. Voice recognition couldn’t handle accents. Synthesis sounded robotic. The whole pipeline was a chain where one weak link broke everything.

But that’s changing. “Now you can actually do all of that within a single model,” Olga notes. Modern foundation models do ASR, translation, and synthesis in one shot. “ASR systems are becoming way more accurate. Then you have the underlying transformer-based translation model. And we see that large language models are dramatically outperforming neural machine translation.”

The result? Live interpretation—real-time, voice-to-voice, in a conference hall—is now faster than human interpretation for many language pairs.

Text Translation Dies First. Conversation Lives Longer.

So why does text translation disappear while voice lingers?

Here’s the difference: text is content. It lives on a website or in a document. It’s static. You can generate it once, review it, publish it. If a mistake appears, you can update it. The cost of fallibility is manageable.

Voice is conversation. It’s live, ephemeral, synchronous. When someone speaks to you in real-time, they need real-time understanding. They need to hear your response now. But more importantly, they need to feel understood—not just linguistically, but emotionally, contextually, culturally.

“People still speak to each other, right? And people still want to hear and speak their own language,” Olga explains. There’s an intimacy to speech that text doesn’t have. When your mother speaks to you in Russian, and you hear her in Russian (even if the voice was synthesized), something emotional happens that’s harder to replicate with text.

Text translation is infrastructure. It powers commerce, documentation, support. It’s economically replaceable. Voice interpretation is human connection. The bar for replacement is higher.

Where Interpretation Is Winning Right Now

The breakthrough isn’t personal phone calls. It’s event interpretation. Conferences, courtrooms, medical consultations—these are the early victories.

“You needed to hire like 200, 500 interpreters before,” Olga notes. A three-day conference in 20 languages meant paying for dozens of interpreters. Now you bring a platform—call it a “simultaneous interpretation engine”—and it handles the whole conference automatically. Accuracy has gotten good enough that the platform is better than hiring unvetted interpreters.

The win is biggest in healthcare and legal settings. “The support of patients that don’t speak English, for instance, like light years above where it was before,” Olga says. A patient in Madrid no longer has to wait for a human Spanish interpreter to translate their symptoms to an English-speaking doctor. The AI does it live, and accurately enough for diagnosis.

But here’s where it gets interesting: courtroom interpretation and medical consultation require higher accuracy than casual conversation. A mistranslation in a doctor’s office could kill. Yet AI is already winning at these domains. Why? Because the speaker (the patient) knows there’s no human interpreter. They adjust their speech. They speak clearly. They check for understanding. The lower stakes on the human side paradoxically make the system work.

Conversation Stays Human Longest

Now contrast that with a casual conversation between friends in different languages. The bar is actually higher. People don’t adjust their speech. They use idioms, slang, cultural references, inside jokes. They interrupt each other. They mean something different from what they say.

“My parents spoke to me in Russian. Your parents spoke to you in Mandarin. So we still want to hear our mother tongue,” Olga says. And it’s not just a language preference—it’s about identity. When your parent speaks to you in their native language, they’re speaking their self. Translation of that breaks something.

This is where the prediction gets bold: automation will handle institutional interpretation (courts, hospitals, conferences) faster than it handles intimate interpretation (family calls, friendships, close relationships). The high-stakes formal settings where humans have already accepted lower interpersonal friction—those flip to AI first. The low-stakes high-emotion settings stay human longer, even if the technology works.

“Event globalization is a huge leap in automated interpretation,” Olga says. She’s referring to the trend of using AI to make global conferences and events accessible to non-English speakers simultaneously. That’s a market with clear ROI. Global companies save money, employees from all countries hear the information live. It works.

But will you use AI to interpret a conversation with your grandmother in her native language? Probably eventually. But you’ll resist longer.

The Actual Timeline

Here’s how Olga sees the transition:

5-10 years (near-term): Institutional interpretation becomes AI-native. Hospitals, courts, conferences all shift to AI interpretation. Some human interpreters remain for high-stakes legal cases, but the default is AI.

10-20 years (medium-term): Consumer real-time translation becomes fluent. Travel, business meetings, casual calls between friends in different languages shift to AI. Family calls might still use human interpreters if you can afford it, or you might use AI and accept lower fidelity. The Doraemon translator becomes real.

20+ years (long-term): People still prefer to hear their mother tongue in intimate contexts, but AI has gotten so good that the technological barrier is gone. The remaining human interpreters are specialists: legal experts who testify in court, cultural consultants who work with refugees, educators who teach languages.

But unlike text translation—which becomes fully automated—some niche for human interpreters probably survives in the voice space. Why? Because the bar is emotional, not just technical. Parents still hire tutors even though apps could teach their kids. Some things don’t automate just because they can.

The Missed Follow-Up

There’s a thread Olga touched on that never got fully explored: what happens to human connection when we optimize for convenience? She mentioned it obliquely: if AI removes the friction of language, do we lose the motivation to learn each other’s languages? Do we become more isolated in our monolingual bubbles, even though we can talk to anyone?

“Are we now more ready for an engagement where we’ll always be pleased and never contradicted?” she asked. If your AI interpreter always confirms your understanding, always agrees, always makes you feel understood—but the human on the other end is getting the same comfortable experience in their language—have you deepened the relationship or just both retreated into personalized bubbles?

That’s the question that doesn’t get answered in the technical discussion. And it might be why interpretation stays human longer than translation. It’s not about capability. It’s about whether we want to automate the place where we’re forced to learn another person’s language, bend to their way of speaking, actually meet them halfway.

FAQ

What’s the difference between translation and interpretation?

Translation is text-to-text (books, documents, websites). Interpretation is speech-to-speech, usually real-time (conferences, conversations, phone calls). The workflows are different: translation can be edited and reviewed; interpretation happens live with no do-overs.

If voice-to-voice translation is now one AI model, why can’t it replace human interpreters instantly?

Because human interpreters do more than translate words. They manage conversational flow, catch cultural misunderstandings, adjust pace, ask clarifying questions, and build rapport. The AI can handle the language conversion but struggles with the interaction management part, especially in low-stakes personal conversations where people aren’t adjusting their speech for clarity.

Where is AI interpretation already replacing humans?

Conferences, courtroom testimony, medical consultations, and event globalization. In these contexts, speakers expect less interpersonal finesse and more accuracy. The system works because human expectations are calibrated to institutional settings, not intimate ones.

Why would people keep hiring human interpreters if AI works?

For legal cases where testimony could be challenged, for nuanced business negotiations where a misunderstanding has huge consequences, and for families who want to preserve the intimacy of speaking in their mother tongue with someone they love. The technology might work, but the emotional need for a human interpreter survives.

What about accents? Can AI still struggle with them?

Less than before, but yes. The acoustic model in modern systems is much better, but regional accents and non-native speakers still trip up ASR. This is why interpretation works better in institutional settings (clear speech) than in casual settings (slang, mumbling, multiple voices).

Will event interpretation become the default for all conferences?

Probably within 10 years for most companies. The ROI is clear: hiring 200 interpreters costs $500K. Running an AI interpretation platform costs maybe $20K. Companies will choose AI unless there’s a specific reputation risk (a UN summit wants human interpreters for symbolic reasons).

What happens to simultaneous interpretation (the hardest type)?

That’s where AI made the biggest leap. Simultaneous interpretation—where the interpreter speaks while the source speaker is still talking—is brutally hard because you have to predict what the speaker will say next. AI models that handle this in one pass are dramatically better than human interpreters at this specific task. Humans will be replaced here first.

Are there languages where AI interpretation is much worse?

Yes. Low-resource languages, heavily accented speech, and contexts with lots of jargon or cultural reference. If you’re interpreting Nepali tech entrepreneurship jargon at a startup conference, the AI might struggle more than a human interpreter who understands both the language and the culture.

What if AI interprets a family conversation and gets the meaning wrong? Can you sue?

Not clearly. This is a emerging legal question. If someone relies on AI interpretation for a medical decision and it’s wrong, is the AI provider liable? Is the user liable? This uncertainty might accelerate demand for human interpreters in high-stakes settings, even as AI wins in low-stakes settings.

Will we eventually have a universal translator, like in Star Trek?

In institutional settings (conferences, hospitals), yes—that tech exists now. For intimate conversations where every nuance matters and you want to be understood emotionally, not just linguistically? That’s harder. You might get accuracy but lose the connection that comes from the effort to understand.

What’s your prediction for home use (family calls across languages)?

Within 15 years, video calls with real-time interpretation will be standard in consumer apps. Most families will use it. But some will still prefer hiring human interpreters for weekly calls with grandparents—not because the tech doesn’t work, but because it feels different to know a human is holding that conversation open for you.

Full episode coming soon

This conversation with Olga Beregovaya is on its way. Check out other episodes in the meantime.

Visit the Channel

Related Insights