What AI Hallucination Actually Means — Lessons From 15 Years of Grading Essays
Jan Liphardt, CEO at OpenMind
Hallucination is the buzzword of 2024-2026. Every AI researcher talks about it. Every startup worries about it. Every policy paper demands solutions to it. But most discussions get hallucination wrong because they treat it as a uniquely AI problem.
Jan Liphardt, who has graded undergraduate essays for 15 years and now builds operating systems for robots that use large language models, has a more sobering perspective: hallucination isn’t something AI systems invented. It’s something humans do constantly. The question isn’t “how do we eliminate hallucination?” It’s “how do we build systems that can catch and correct the inevitable errors that come from any imperfect decision-maker?”
“I have a horrible secret,” Liphardt said in his interview with TwoSetAI. “I’ve been grading undergraduate homework for 15 years. And let me tell you that if you’ve ever submitted a problem set at three in the morning and you wake up and you don’t remember what you wrote, let me tell you that the teachers and faculty grading your problem set have lost all illusion about hallucinations.”
That reframing is more useful than most of the technical discussion around hallucination.
Hallucination as a Human Universal
When you read an essay that a student submitted at 3 AM, you’re not reading the result of a perfectly calibrated mind. You’re reading fabrications, inconsistencies, leaps in logic, and sometimes creative interpretations of facts that are flatly wrong. The student genuinely believed what they were writing at the moment. They were hallucinating — not lying, hallucinating.
Humans do this in real time. You walk into a meeting and confidently state a fact you’re pretty sure is true. You’re wrong. You order something online and tell your friend a feature it doesn’t have. You remember a movie scene that never existed. You confabulate details to fill gaps in your memory. These aren’t failures of your intelligence. They’re how human cognition works.
The difference is that humans are biologically forced to accept some imprecision. Your brain uses heuristics and shortcuts because evolution optimized for “good enough and fast” not “perfect and slow.” You can’t possibly verify every fact before you speak. You’d never speak.
An LLM is doing the same thing — running a sophisticated statistical prediction engine that produces the next most likely token. When that’s a hallucination, it’s not a bug unique to neural networks. It’s a fundamental feature of how decision-making under uncertainty works.
Why We Keep Expecting Perfect AI
The mismatch between expectation and reality comes from how we talk about AI. We call it “artificial intelligence” as if intelligence is a unitary thing you either have or don’t. We compare AI systems to humans as if the comparison is symmetric: “Is the AI as good as a human doctor?” But the real question is always: “Is the AI better than the human at catching errors, and what systems do we need to build around it to make it reliable?”
Liphardt’s insight is that we already have systems for dealing with human error and unreliability. We have juries, courts, judges, second opinions, peer review, audits, and oversight. When a surgeon makes a diagnosis, we have peer review. When a lawyer makes an argument, we have opposing counsel. When a journalist writes a story, we have editorial review.
These institutional safeguards exist because we accepted long ago that individual decision-makers — even very smart ones — are unreliable. The solution was never to demand perfection. It was to build systems that catch and correct errors.
“My recommendation would be first to look at the decisions all of us make daily before we get all worried or before we somehow say that robots are particular in their ability to make bad decisions,” Liphardt said.
The robots using LLMs are just the first AI systems where we’re forced to confront this explicitly. We’re not used to thinking of our software as an unreliable agent that needs oversight. We’re used to thinking of software as deterministic. But the moment you add machine learning, you’re adding the same fallibility that humans have.
The Practical Fix: Committee Decision-Making
So how do you build AI systems that work reliably despite the fact that their individual decisions will sometimes be wrong?
Liphardt’s answer is deceptively simple: use multiple systems that check each other. OpenMind uses what they call a “mother LLM” or mentor — a second language model that watches the main LLM and critiques its behavior every 30 seconds. “Stand up straight. Look at people ahead of you. Preface everything you say with [specific phrases],” the mother LLM instructs.
This isn’t perfect. But it’s not designed to be perfect. It’s designed to reduce the error rate and catch obvious hallucinations before they propagate.
The principle here is borrowed from institutional design: never have a single decision-maker. Always have a check. In courts, it’s the jury checking the judge. In science, it’s peer review checking the researcher. In medicine, it’s a second opinion. The same principle applies to AI systems.
A single language model hallucinating about what it saw is bad. A single language model hallucinating while a second model is actively monitoring and correcting is much better. Still not perfect, but better.
When Humans Are the Real Safeguard
The other layer OpenMind has built is human oversight. They maintain what Liphardt calls a “teleops and observability portal” — a system where human observers (nurses, teachers, retired police) can watch what the robot is doing in real time and step in when something goes wrong.
This sounds like a regression — why would you need humans if you have AI? But it’s actually sophisticated error handling. You’re not relying on the AI to never make mistakes. You’re assuming mistakes will happen and building a system where a human can quickly correct them.
This is how Waymo works too. Waymo’s self-driving cars have incidents, but they also have 24/7 human operators who can take remote control if needed. The system isn’t “perfect autonomous AI.” It’s “mostly autonomous with human backup.” That’s a much more honest and effective design than promising perfect autonomy.
Liphardt’s observation is that this will be the pattern for most robotics and AI deployments for at least the next five years: “For a lot of these home deployments or schools or hospitals, we’re looking at easily a period of the next five years where many of these systems will have an awesome, smart human very close to the robots, figuratively speaking.”
That’s not a limitation. That’s accepting reality about how decision-making at scale actually works.
The Real Problem Hallucination Reveals
The real issue hallucination surfaces isn’t that AI is unreliable. (Of course it is — it’s approximating infinite complexity with finite parameters.) The real issue is that we’ve spent decades treating software as deterministic and perfect. Now we’re moving to software that’s probabilistic and approximate. That requires a different mental model.
When you buy software, you expect it to behave the same way every time. When you ask an AI system a question, you should expect it to give you different answers. Sometimes better, sometimes worse. That’s not a bug. That’s the nature of the technology.
The systems that will be trusted with real autonomy (robots in hospitals, autonomous vehicles, teaching assistants) will be the ones that build in the assumption of error from day one. They’ll have oversight, checkpoints, human-in-the-loop mechanisms, and fallback plans. They won’t try to achieve perfection. They’ll achieve reliability through redundancy and correction.
FAQ
Is hallucination more common in LLMs than in human decision-making?
By raw percentage? Probably not. Humans are constantly generating false memories and non-factual statements. The difference is humans are embedded in a social context where those errors are usually caught quickly. An LLM generating false information without social friction is more dangerous, not because hallucination is unique to AI, but because there’s less built-in error correction.
If hallucination is normal, why do people treat it as a crisis?
Because we’re used to software being deterministic. If Slack returns the wrong message, it’s a catastrophic bug. But people are comfortable with doctors sometimes being wrong — we’ve built institutional safeguards around that. Treating LLM hallucination as a crisis is partly justified (it’s new) and partly confused (it’s not new, just visible for the first time).
Does this mean AI systems will always need human supervision?
For the next 5-10 years in high-stakes domains like healthcare, education, and home robotics, yes. Eventually, some domains might achieve reliable enough autonomous operation. But that doesn’t mean “no human oversight ever.” It means the ratio of humans to machines becomes more efficient. One human might monitor 100 robots instead of 10. That’s still human-in-the-loop.
What’s the difference between a hallucination and a plain mistake?
A hallucination is when the system confidently states something false because it was the statistically highest probability token. A mistake is when the system knows the answer was wrong but executed incorrectly. LLMs are prone to hallucination. Symbolic systems are prone to mistakes. Hybrid systems can have both. The distinction matters less than building systems that catch either kind.
Should we ever trust AI systems for life-critical decisions?
Only if you trust them the same way you’d trust a doctor, lawyer, or engineer — with oversight, second opinions, and accountability structures. An AI system that’s 99% accurate sounds good until you realize that 1% of the time it’s making a life-or-death call. Build the institutional safeguards first, deploy the AI second.
How do you know when an AI system is hallucinating versus when it’s extrapolating reasonably?
You don’t, in real time. That’s why human oversight is necessary. The “mother LLM” approach is one way — another system watches and checks. A second way is human observation. A third way is testing against known facts. The key is building multiple layers of error detection rather than hoping a single system is always right.
What’s the most dangerous kind of hallucination?
When the AI is confidently wrong about something that matters, in a way that seems plausible to humans. A robot confidently stating the wrong medication name. An autonomous vehicle confidently making an unsafe lane change. A hiring algorithm confidently filtering out qualified candidates. These are dangerous because the confidence triggers human trust, but the content is false. Building skepticism into human operators is important.
How do you scale human oversight if millions of robots are deployed?
You don’t scale 1:1. Instead, you design systems where humans intervene only when the AI’s confidence drops below a threshold or when certain edge cases occur. You use statistics and logging to find systemic issues. You build communities where users report problems. This is how Tesla handles Autopilot — millions of cars, but centralized data collection and periodic over-the-air corrections based on what went wrong.
Will AI systems ever be more reliable than humans at decision-making?
In narrow tasks, yes, they already are. In broad, contextual decision-making with uncertainty, probably not in the next decade. But the question frames it wrong — you’re not replacing humans with AI. You’re building systems where AI does what it’s good at (pattern recognition, consistency) and humans do what they’re good at (judgment, context, ethics). The goal is the hybrid, not the pure AI.
Is Liphardt saying we should just accept that AI hallucinations will happen?
He’s saying we should design systems assuming hallucinations will happen. Accept the reality of the limitation. Build oversight, correction mechanisms, and fallbacks. Don’t try to engineer away hallucination — that’s not possible. Engineer around it instead.
Full episode coming soon
This conversation with Jan Liphardt is on its way. Check out other episodes in the meantime.
Visit the Channel