Founder Insight

Why Converting Everything to Text Makes AI Systems Actually Debuggable

Jan Liphardt, CEO at OpenMind

Listen on TL;Listen Prefer to listen? Hear this article read aloud.

When Jan Liphardt shows his OM1 architecture to computer scientists, their first reaction is usually incredulous. Everything flows through natural language? All the vision data, the audio, the sensor streams — you’re converting it all to text?

Yes. And it’s intentional. This design choice, which seems backwards from a pure efficiency standpoint, is actually solving a real problem that most AI systems have refused to acknowledge: they’re black boxes, and when they fail, nobody knows why.

“There’s something absolutely fascinating about how much harder it becomes to understand what’s going on,” Liphardt explained. “When you try to push raw video data straight into a sophisticated model, you quickly run into either hardware or software constraints, or you end up with a system that’s almost impossible to debug or almost impossible to understand what’s going on or why certain things happened.”

The OM1 design makes a different tradeoff: slower decision latency in exchange for extraordinary debuggability and composability.

The Black Box Problem in AI

The allure of end-to-end learning is obvious: pipe raw video into a neural network, let it learn what matters, and trust the output. It works. Neural networks can extract patterns from video faster than hand-crafted features ever could. This approach has powered most of the recent breakthroughs in computer vision.

But there’s a dark side. When an end-to-end system fails, you don’t know why. Did the model misclassify an object? Was the lighting wrong? Did a previous layer make a mistake that cascaded downstream? Is the issue in the data or the model architecture? You can run tests, build interpretability tools, use saliency maps, but you’re always reasoning backwards from a failure.

This is fine for some applications. If your image classifier sometimes misidentifies a dog as a cat, you learn about it when someone posts a complaint on social media. But if a robot is caring for your grandmother and the system fails to recognize she’s fallen, the cost of not knowing why is catastrophic.

Large language model systems have the same problem. You can prompt an LLM, it returns an answer. Sometimes the answer is great, sometimes it’s a hallucination. What went wrong? The LLM tokenized the input. It computed probability distributions over its learned representations. It sampled the next token. Somewhere in that process, it chose a path that led to nonsense. Where? Which layer? Which component? You can’t see inside.

Text as a Debugging Interface

The OM1 approach inverts this. Convert vision to text. Convert audio to text. Convert sensor readings to text. Then route that text through multiple specialized models that can see what the other models are doing.

“One of the benefits of having models literally talk to one another using natural language is that you can eavesdrop on what all those models are saying very efficiently and very clearly and very well,” Liphardt said.

This is the key insight: text is humanly readable. When a vision model describes what it sees in natural language (“I see a person lying on the floor, motionless”), a human can read that and immediately verify if it’s right. When a safety model responds (“Alert: fall detected. Calling teleop nurse.”), you can trace the reasoning.

Compare that to trying to debug a neural network by looking at activation patterns across a thousand hidden layers. One approach gives you information you can reason about. The other requires specialized tools and deep learning expertise just to understand what went wrong.

The trade-off is latency. Converting vision to text, routing through multiple models, and synthesizing action takes longer than an end-to-end system that outputs motion commands directly. But for the use cases OM1 targets — humanoids in homes, schools, hospitals — one-to-two-second decision latency is acceptable.

For other use cases, it’s not. “Imagine you’re building a drone to target a tank. In the last few hundred milliseconds, that targeting loop needs to be much, much, much faster. Or imagine you’re building a ballerina humanoid to balance on her big toe. In that case your loops need to be operating at 500 Hertz,” Liphardt noted. “The decision we’ve made as a company is we don’t do drone targeting, don’t do ballerina humanoids, we don’t do onion chopping.”

This is a crucial design decision: OM1 is optimized for domains where latency matters less than debuggability. That matches reality. Most humanoid robotics applications — education, health companionship, household assistance — don’t need millisecond responses. They need reliable, understandable responses.

Building Guardrails Into the Data Bus

A second advantage of text-based data flow is that you can inject guardrails at any point. Want the robot to never grab something without asking permission? You can audit every text message about intentions and filter unsafe ones. Want to prevent the robot from repeating certain behaviors? You can add a rule that rejects suggestions matching a pattern.

With end-to-end neural systems, applying a guardrail is much harder. Where would you inject the constraint? At what layer? The model has learned associations across the entire network. A rule you add to one layer might not propagate correctly to the output.

Text-based systems let you apply rules as first-class constraints. “We’re happy to apply natural language guardrails to different parts of the system. So you augment or prevent certain behaviors,” Liphardt explained.

This is what safety engineers have done for decades in other domains. Commercial aircraft have layers of automation with interlocks and manual overrides. They don’t trust a single system to be safe. They build guardrails into the architecture itself. Text-based robotics can do the same thing more explicitly than end-to-end neural systems.

The Composability Win

There’s a third advantage that matters for long-term maintenance: composability. When you update one specialized model in OM1, the rest of the system continues working without retraining.

“You can update and improve one small part of the system as opposed to having to redo everything,” Liphardt said.

This is the difference between a modular system and a monolithic system. With end-to-end learning, the entire network learns interdependencies. Change the input representation slightly, and the whole system needs retraining. But with OM1, you can swap out the vision model for a better one, and as long as its output text is in the same format, the downstream models don’t care.

This matters practically because model development is iterative. Your vision model is good at recognizing humans but not animals. You fine-tune it. With end-to-end learning, fine-tuning the vision module might degrade downstream performance in unexpected ways. With text-based routing, you get localized changes.

The Efficiency Cost Is Real

The downside is real: text-based systems are slower and use more compute than optimized end-to-end systems. Every conversion to natural language is overhead. Every model call is latency. In aggregate, OM1 might be 10x slower at the architectural level compared to a specialized end-to-end neural network.

Liphardt doesn’t hide this. He embraces the tradeoff explicitly. OM1 isn’t trying to compete with specialized systems at specialized tasks. It’s trying to create a general-purpose system that can handle a range of tasks without requiring complete retraining.

That’s a different optimization target. And it’s the right target for a robotics OS that multiple companies and researchers will run on diverse hardware.

Why This Matters for AI Transparency

The OM1 architecture is relevant beyond robotics because it points to a broader principle: transparency and debuggability might need to be first-class design goals, even if they cost performance.

Right now, most AI systems are built with performance first, interpretability second. The assumption is: get the system working well, then figure out how to explain it. But as AI systems take on more consequential roles, that order is backwards.

Systems that are built with transparency first might be slower, but they’re trustworthy. You can audit them. You can see where decisions come from. When they fail, you can fix them.

The text-based architecture isn’t optimal. It’s a deliberate choice to be suboptimal in performance to be optimal in understandability.

FAQ

Doesn’t converting vision data to text lose information?

Yes, but the question is whether you need that information for the task. For recognizing a person and identifying a fall, converting “person lying on floor, motionless” is sufficient. For tasks requiring fine-grained visual control (surgery, precise assembly), you’d need end-to-end visual systems. OM1 chooses use cases where text descriptions are adequate.

What happens if the vision model makes a mistake when converting to text?

That mistake propagates downstream, just like errors propagate through neural networks. But the advantage is you can see it. The safety model will see “person standing, active movement” when they’re actually lying down. A human monitoring the system can catch this. In end-to-end systems, you might not catch the error at all.

How much slower is OM1 compared to a specialized vision system?

Probably 5-10x slower at the architectural level. That means 1-2 second decision latency instead of 100-200 milliseconds. For home robotics and education, that’s acceptable. For applications requiring fast feedback, it’s not. This is a feature, not a bug — it’s the boundary of what OM1 can handle.

If text is so great for debuggability, why don’t all AI systems use it?

Because most AI systems are built in domains where performance matters more than transparency. Image classification. Recommendation systems. Language models. These systems are evaluated on accuracy or speed, not on interpretability. When the domain is safety-critical (robotics, autonomous vehicles, healthcare), the tradeoff tilts toward transparency.

Can you mix end-to-end neural models with text-based architectures?

In theory, yes. In practice, it’s messy. You’d have some modules that are black boxes and some that are transparent. The real benefit of OM1 is that everything uses the same data format. Once you mix paradigms, you lose that coherence.

What if you use better vision models that hallucinate less?

Then your text descriptions are more accurate, which helps. But you still face the composability challenge — if you swap in a better vision model, downstream models might need retraining. The text-based approach doesn’t guarantee perfection. It guarantees that when things go wrong, you can see where.

Is this why OpenMind is better at safety than other robotics companies?

It’s one reason. Transparency is only part of the solution. You still need good models, good guardrails, and human oversight. But the architecture makes it much easier to understand what’s happening when something goes wrong. That matters for safety.

Could car companies building humanoids use this architecture?

Probably not for production humanoids. Car companies optimize for cost and speed. The text-based approach is slower and requires more compute. Once a design is finalized, companies might optimize it away. But during development and prototyping, transparency is valuable.

Is there a way to have both speed and transparency?

Not without significant engineering effort. You could build a system that runs fast inference in production but logs all intermediate representations for debugging. That’s expensive. The OM1 choice is simpler: accept the latency cost and get transparency built-in.

What’s the lesson for people building other AI systems?

Think about what you’re optimizing for. If it’s a benchmark number, go fast. If it’s a system someone needs to understand and trust, go transparent. The gap between those goals is the real engineering challenge in AI right now.

Full episode coming soon

This conversation with Jan Liphardt is on its way. Check out other episodes in the meantime.

Visit the Channel

Related Insights