Founder Insight

Why AI Hallucinates Most in the Languages You Ignore — And Why It Matters

Olga Beregovaya, VP AI at Smartling

Listen on TL;Listen Prefer to listen? Hear this article read aloud.

A language model confidently tells someone that a mushroom is safe to eat. The person takes the model at its word. Fast forward, and there’s a tombstone. The model’s response? “Sorry, this mushroom was poisonous. What would you like to learn about poisonous mushrooms?” This is Olga Beregovaya’s favorite meme—and her favorite metaphor for how AI works. Models are trained to sound confident regardless of whether they know the answer.

For most of the world, this confidence is dangerous. Olga, VP of AI at Smartling, has spent 27 years in translation technology watching models get better at sounding fluent. But fluency isn’t the same as accuracy. And in the languages spoken by billions of people, that gap is widening.

The Language Training Data Is Wildly Unbalanced

Here’s the hard truth about large language models: they were trained primarily on English. When they encounter English, they have billions of tokens to draw from—decades of published books, articles, code, conversations. When they encounter Swahili? Maybe millions of tokens, if that.

“The representation of foreign languages in the model training corpus is skewed,” Olga explains. “The lower you go into how many words that language has in its lexicon within the training data, the higher the odds are that the models will hallucinate.”

This creates a cascade of failures. First, you get lexical coverage problems—the model has never seen words for certain concepts, so it makes them up or confuses similar words. A rare technical term in Tagalog might get mapped to the wrong thing entirely. Second, you get cultural knowledge gaps—the model has learned English-centric assumptions about the world, so when generating in Amharic, it defaults to English patterns that don’t exist in that language.

The worst part? The model sounds just as confident in Amharic as in English. It doesn’t know that it’s making things up. “Models hallucinate much more in foreign languages,” Olga says flatly. “And the lower you go into how many speakers that language has, the higher the odds are that the models will hallucinate.”

Fluent Nonsense Is Dangerous

Hallucination in AI doesn’t always mean the output is gibberish. Often it’s the opposite: the output is fluent. The grammar is perfect. The sentences flow. But the facts are completely wrong.

An AI system generates a Bangla marketing page for a financial product. Every sentence is grammatically correct. The tone matches your brand. But it’s describing a product feature that doesn’t exist. A customer in Dhaka reads it, opens an account, and discovers they’ve been misled. The model generated fluent, confident lies.

“Models do hallucinate, right? Because they are tailored to be giving an answer, and giving a very confident answer,” Olga notes. There’s no uncertainty flag. No “I’m not sure about this.” Just smooth, confident output that happens to be wrong.

This is especially dangerous in low-resource languages because there’s often less oversight. When you’re translating to Spanish, you probably have native Spanish speakers on your team who can spot errors. When you’re generating in Hausa or Quechua, your company might not have a single speaker on staff. The hallucination goes live undetected.

The Semantic Similarity Check Catches Some, Not All

So how do you catch hallucinations before they harm users? The first defense is semantic similarity checking—comparing whether the generated output (in the target language) actually means the same thing as the source (in English or the original language).

“If you see that the target is semantically dissimilar to the source, then obviously something went astray,” Olga explains. You can use a language model to judge similarity, or you can use older NLP techniques like LASER embeddings or semantic vector comparisons.

But there’s a catch: this only works if you have a source. If you’re generating content directly in a long-tail language from a brief (not translating from English), there’s no source to compare to. The model can hallucinate freely, and you won’t catch it with semantic similarity.

Other defenses exist but are labor-intensive. You can check for factual accuracy by running external searches—does this claim about the weather actually match real data? You can analyze statistical patterns—if the source has 5 words and the target has 50, something’s suspicious. You can use named entity recognition to catch if the model invented companies or people that weren’t in the source. But all of these are extra layers of cost and latency.

“We have a lot of hallucination mitigation techniques and you cannot catch all of them,” Olga says. The reality is grim: in languages with little training data and few speakers to review outputs, hallucinations are a cost of doing business.

Where Hallucinations Hit Hardest

The danger zones are predictable:

Long-tail languages: Any language outside the top 10-20. If your model was trained on minimal Somali data, its Somali outputs will be unreliable. But Somali speakers need accurate content too—whether it’s medical instructions, legal notices, or product documentation.

Low-resource technical domains: Generating documentation for software or medical devices in Vietnamese, Polish, or Greek is riskier than generating marketing copy. Technical writing has less tolerance for hallucinated details.

Underrepresented cultural phenomena: The model might have learned “market” in English, Spanish, and Mandarin but never encountered the Uzbek concept of the bazaar or the specific cultural connotations of what trading means in that context. So it generates something that sounds right but is culturally wrong.

Languages with little published training data: A language spoken by 10 million people but with little digital presence (few books, websites, social media) will experience more hallucinations than a language with 10 million speakers and rich digital presence.

The Fix: Layered Defense, Not One Solution

There’s no magic bullet. Instead, companies like Smartling layer multiple defenses:

First, model selection. Some models are better at specific languages than others. Llama models perform differently in low-resource languages than GPT models. You don’t use the same model for all 100 languages.

Second, data curation and fine-tuning. If you’re serious about quality in a specific language, you gather high-quality examples in that language and fine-tune the model. This is expensive—it’s why most companies only do this for their top 10 languages.

Third, retrieval-augmented generation (RAG). Instead of letting the model generate from pure parameters, you give it access to a knowledge base of verified facts. If the model is generating Tagalog content about your company’s pricing, you feed it the actual current pricing data so it can’t make it up.

Fourth, human review. For critical content, someone who speaks the language fluently reviews every generated piece. This is slow and expensive, but for regulated industries (healthcare, finance, legal) it’s non-negotiable.

“Platforms like Smartling would definitely have hallucination mitigation techniques,” Olga says. “Send it back to the model, think again. Or send it to a fallback model.” You layer defenses until the risk is acceptable for your use case.

The Hardest Truth: Low-Resource Languages Get Fewer Defenses

Here’s where equity becomes a business problem: the languages with the most hallucination risk are the languages where companies invest least in mitigation. Why? Economics.

If you serve English and Mandarin speakers, you have huge markets justifying investment in quality. You fine-tune models, hire reviewers, build RAG systems. But if you’re serving Amharic or Lao speakers—millions of people, but smaller total market value—you often deploy with lighter defenses. The model is rolled out faster, with less review, with thinner mitigation.

“The less data you have, the higher are the odds of hallucination,” Olga observes. And the less data a language has globally, the less likely your company is to invest in solving that hallucination problem specifically for that language. It becomes a vicious cycle: under-resourced languages get under-resourced quality infrastructure, so hallucinations persist longer in those communities.

For enterprises going global, this is a silent risk. If you’re expanding to emerging markets where English is less common, your content quality is likely worse—not because of your intentions, but because the economic math of language model training works against low-resource communities.

FAQ

What’s the difference between hallucination in English and hallucination in French?

Hallucination is the same phenomenon in both, but the detection is easier in well-resourced languages. With English, you have many reviewers and tools to catch errors. With a language with minimal training data, the errors are more common and less likely to be caught before going live.

Can you just use a better model to fix hallucination in low-resource languages?

Partially. Some models do better than others on long-tail languages (Llama performs better than GPT on some low-resource languages). But even the best model will hallucinate more in a language with minimal training data. The problem is the data, not just the model.

Why do companies train models primarily on English if it harms other languages?

Training data follows money and digital footprint. English has centuries of published books, modern digital ecosystems, and global internet dominance. Low-resource languages have less published text available. Training models requires trillions of tokens; if a language has only millions publicly available, it will be under-represented.

How do you detect hallucinations if there’s no source to compare to?

You use fallback techniques: semantic similarity to any available source, external fact-checking (does this claim about history check out?), statistical pattern analysis (does this output have suspiciously different word counts?), and human review by native speakers. But if hallucinations are fluent and plausible, they’re hard to catch without deep expertise.

What does fine-tuning on a language actually fix?

Fine-tuning on high-quality examples in a specific language improves lexical coverage (the model learns real words used in that language), cultural appropriateness, and domain-specific terminology. But it doesn’t fix hallucination entirely—it reduces the rate of hallucination, especially for common patterns.

Is RAG (Retrieval-Augmented Generation) the solution to hallucination?

RAG helps when you have verified facts to feed the model. If you’re generating product documentation, you give it your actual product specs. But RAG only works for factual domains. For creative or contextual content, the model still hallucinates.

Should companies require human review of all generated content in low-resource languages?

For high-stakes content (legal, medical, financial), yes. For marketing and user-generated content, the ROI depends on your market size and risk tolerance. Small markets can’t always justify native-speaker review for every piece, which is why hallucination persists there longer.

What’s “modelese” and how does it relate to hallucination?

Modelese is the distinctive style of output that reveals the text was generated by AI—often polished, overly formal, or slightly off in word choice. Hallucinations often have modelese markers: fluency that’s too perfect for a low-resource language, or confident claims that don’t match cultural knowledge. Spotting modelese can be an early warning sign of hallucination.

How many languages does a company need to have before hallucination is a serious problem?

It’s not about number of languages; it’s about which languages. Hallucination risk is high immediately if you support any language that wasn’t heavily represented in the model’s training data. For most enterprises, that means most languages outside the top 20.

What should a global company do about hallucination in low-resource languages?

Map your language portfolio by data-availability and speaker numbers. For high-stakes languages, invest in fine-tuning and human review. For others, implement semantic similarity checks and empower customers to flag errors. Acknowledge that zero hallucinations in all languages is expensive; accept and mitigate the risk strategically.

Full episode coming soon

This conversation with Olga Beregovaya is on its way. Check out other episodes in the meantime.

Visit the Channel

Related Insights