Founder Insight

Why Your AI Agent Should Write Code, Not Run in a Loop

Alex Reichenbach, CEO at Structify

Listen on TL;Listen Prefer to listen? Hear this article read aloud.

A 5% error rate per agent step sounds tolerable. Until you do the math.

Run that 5% across ten steps and you’re at a 40% compounded failure rate. For a one-off chat interaction, 40% is recoverable — the user retries. For a production data pipeline that ingests, joins, transforms, and writes to a downstream system, 40% means the pipeline is unusable. Every multi-agent system that runs probabilistic decisions at every step hits this wall.

Alex Reichenbach, CEO of Structify — the AI data team for enterprises — built his company around the architectural decision that gets you out of it. The decision is simple to state and rare to execute: don’t let the LLM make every decision. Let the LLM write code, and let the code run.

“We are writing code that gets you an answer,” Alex explains. “It means that code is gonna execute the same time every time. So if you’re happy with it the first time, you’re gonna be happy with it again.”

That distinction — between an agent that loops and an agent that codifies — is the architectural fault line in production AI systems right now.

The math problem with multi-agent loops

Multi-agent systems work by chaining LLM calls. Step one: agent decides what to do. Step two: agent decides how to do it. Step three: agent executes. Step four: agent reviews. And so on.

Each step has an error rate. Even at the model frontier, that error rate isn’t zero — call it 5% for a well-prompted, well-tested step. The errors compound multiplicatively: at step N, your accuracy is roughly 0.95^N. At ten steps, you’re at 0.6 — a 40% chance the pipeline produces wrong output.

For chatbots, that’s fine. For agentic tools that write follow-up emails, summarize meetings, or draft documents, the user catches errors before they propagate. For production data pipelines, where errors flow downstream into reports, board decks, customer-facing dashboards, and financial models, 40% is catastrophic.

Tools like Manus, which Alex references explicitly, run this multi-agent architecture. They’re impressive in demos because the demos are short. They struggle in production because production is long.

Code-generation as the architectural escape hatch

The escape hatch isn’t to make the agents more accurate. It’s to remove agents from the execution path entirely.

Structify’s architecture splits the workflow into two fundamentally different phases:

  • The interpretation phase uses an LLM. Given a natural-language request (“find me companies that signed up in the last 24 hours, check if we’ve talked to them in HubSpot, enrich missing ones from the web”), the LLM figures out the strategy. Which APIs to call. Which tables to join. Which transformations to apply.

  • The execution phase is pure code. The LLM doesn’t run the pipeline; it writes the pipeline. The output is a deterministic Polars-based graph (Polars is similar to Apache Spark) that runs the same way every time.

“It’s not a model making the decision,” Alex says. “So you’re separating the two — the code gen is deterministic, carry out the plan, and then the design part is actually the interpretation part, which involves the LLM.”

This is the architectural move that lets Structify guarantee plan-following accuracy. They don’t promise 100% semantic accuracy on the first try (the user might describe the wrong query). But they guarantee 100% that the plan, once written, will execute the same way every time. That’s the trust that production work requires.

Why most teams build multi-agent loops anyway

If code-generation is the better architecture, why is the AI agent space mostly multi-agent loops? Three reasons:

First, multi-agent loops are easier to demo. Watching an agent reason through a problem in real time is theatrically compelling. Watching code generate and then execute silently is less so. Founders building toward conference demos optimize for the visual.

Second, code-generation has a higher engineering bar. You need a deterministic execution layer (Polars, Spark, or your own equivalent), a code generation layer that writes correct code consistently, a way to handle errors when generated code fails, and a way to maintain the code over time. That’s significantly more infrastructure than “loop the LLM until it gets to an answer.”

Third, the multi-agent loop pattern matches how engineers think about software. When in doubt, add another step. Code-generation requires designing the system top-down — planning what code should look like, then engineering toward generating it.

The architectural lesson Alex offers: if your AI tool is going to be used in production, where errors compound, you have to design out the loop. That’s not a tooling decision; it’s a foundational architecture decision that has to happen in week one.

What this means for builders evaluating their own systems

If you’re building or buying an AI tool for production data work, three diagnostic questions separate the architectures:

  • Does it produce reproducible output? Run the same query twice. If you get different results, an LLM is making decisions during execution. That’s a multi-agent loop. If you get identical results, code is executing.

  • Where does the LLM call happen? If the LLM is called for each row, each record, each decision — you have a loop. If the LLM is called once to produce a pipeline that then runs without it — you have code generation.

  • What’s the error budget? Multi-agent loops require an error budget at every step. Code generation requires it only at the planning step. The difference compounds with pipeline length.

Alex’s framing: “Code, I mean, everything is — there can be errors everywhere, not just if we build it, but if your data team builds it or if an external organization builds it for you. Sometimes data pipelines are not as accurate as you want them to be, and that will always be true. But it is something that we put a lot of work into.”

The errors don’t disappear with code generation. They become bounded — and that’s the difference between a tool that works in production and one that works in demos.

FAQ

What does it mean for an AI agent to “write code instead of looping”?

It means the LLM produces a deterministic code pipeline rather than making decisions on every step. In a loop architecture, the LLM is called for each row, each transformation, each branch — and errors compound across calls. In a code-generation architecture, the LLM is called once to plan the pipeline, then a deterministic engine (like Polars or Spark) executes the code without further LLM involvement.

Why is 5% error per step a 40% failure rate at 10 steps?

Because errors compound multiplicatively. If each step has 95% accuracy, the cumulative accuracy across 10 sequential steps is 0.95^10, which equals 0.60 — a 40% chance that at least one step fails. Multi-agent loops hit this wall fast in production. Code-generation avoids it by collapsing N steps of LLM decision-making into one planning step plus deterministic execution.

How does Structify ensure pipeline accuracy?

By separating interpretation from execution. The LLM writes the code that defines the pipeline; the pipeline itself is a deterministic Polars graph that runs the same way every time. Alex Reichenbach says Structify guarantees plan-following accuracy 100% of the time — once a pipeline works, it works the same way on every subsequent run. Functional tests run nightly against hundreds of queries to catch regressions.

What use cases require deterministic AI architecture?

Anything where errors propagate downstream. Financial reporting, board metrics, customer data enrichment, M&A document classification, large-scale web scraping, regulated compliance work — all require reproducibility. Alex describes financial teams getting burned by Excel pipelines passed across people, where one tweak corrupts downstream numbers. Deterministic code pipelines preserve the full lineage and re-run identically.

What’s the cost difference between Structify and a multi-agent loop tool?

Alex says Structify charges per-seat with cost passthrough on LLM and infrastructure usage at no markup. The architecture matters here: code-generation tools call the LLM far less frequently than multi-agent loops, so the underlying compute cost is lower. The customer benefits from both better accuracy and lower marginal compute cost per query.

How long does it take to onboard a Structify pilot?

For regulated industries, the first step is the SOC 2 / HIPAA security questionnaire, then a pilot run by a Structify deployment strategist. For smaller teams that want to onboard themselves, Alex says the path is to set up the Slack connector and start interacting in natural language. The platform handles unstructured sources (CSVs, PDFs, ZIP files) and connects to standard data warehouses out of the box.

Why don’t more AI agent platforms use code generation?

Three reasons: it’s harder to demo, it requires significantly more engineering infrastructure (deterministic execution layer plus code-gen layer), and it doesn’t match how most engineers naturally think about agent systems. Multi-agent loops are easier to prototype but harder to make reliable. Code generation is harder to prototype but easier to make production-grade.

What’s the difference between code generation and traditional ETL?

Traditional ETL pipelines are written by engineers, tested manually, and deployed once. They’re deterministic but slow to build and slow to change. Code-generated pipelines are written by an LLM from a natural-language description, tested automatically, and rebuilt instantly when requirements change. The result is a deterministic ETL pipeline produced in minutes instead of months.

Are multi-agent loops ever the right architecture?

For exploratory tasks where the path isn’t predictable in advance — research, planning, debugging — multi-agent loops are valuable. The reasoning trace itself is the value. For production tasks where the same pipeline runs daily or weekly, code generation is the right call. The split is between “I need to figure out what to do” (loop) and “I need to execute the same thing reliably” (code).

Full episode coming soon

This conversation with Alex Reichenbach is on its way. Check out other episodes in the meantime.

Visit the Channel

More from Alex Reichenbach

Founder Archetype

Read Alex Reichenbach's archetype profile

The Magician · Classical: Daedalus · Tests & Allies

Related Insights