Why AI-Generated Code Is Just Bytecode

The vibe-coding debate has settled into two camps. Camp One says AI-generated code is fine for prototypes but dangerous in production — it’s messy, ignores best practices, doesn’t respect clean code principles, and produces architectural debt that engineers will have to pay down later. Camp Two says vibe coding is the future and engineers who fight it will be left behind.

Paul Iusztin, the founding AI engineer at a vertical AI startup and author of LLM Engineer’s Handbook (~20,000 copies sold), came down hard in Camp Two during our recording — but with a reframe that reorients the whole debate. The reframe is short, technical, and almost unanswerable once you hear it. “AI-generated code is just another compilation step. We don’t read bytecode. Why are we reading this?”

The argument isn’t dismissive of code quality. It’s about who code quality is for.

What clean code was actually for

Clean code conventions, design patterns, architectural rigor — these existed for a specific reason. Human engineers needed to read code their colleagues wrote, debug code at 2am, and maintain code six years after the original author left the company. Conventions weren’t about elegance. They were about reducing the cognitive load that humans hit when scanning hundreds of files.

“Many people complain about vibe coding that it creates messy code or code that doesn’t respect clean code or good architectural designs,” Paul explains. “But these designs were made for people, to avoid human error. And maybe the AI will not need them anymore.”

The argument lands because it inverts the assumption. Clean code wasn’t an absolute good. It was a workaround for a specific bottleneck — human comprehension at scale. If the bottleneck changes, the workaround should change with it.

The bytecode parallel

Paul’s analogy is what makes the reframe stick. Modern Python (and most high-level languages) compile down through multiple stages before execution. Source code becomes an abstract syntax tree, becomes bytecode, becomes machine instructions. Engineers don’t read the bytecode. They don’t audit the AST. They write source code and trust the compilation pipeline.

“Now we write Python that’s compiled many, many times until it reaches bytecode. Do we read the bytecode to make sure it’s clean bytecode? No, we just trust it,” Paul says. “AI-generated code — just a compilation step.”

In this framing, AI is another layer in the compilation stack. The engineer’s intent is the source. The AI translates intent into a runnable artifact. Engineers don’t need to audit the artifact line-by-line for the same reason they don’t audit bytecode line-by-line: the value the engineer adds isn’t in inspecting the output, it’s in specifying the input.

Angelina extended the metaphor during the recording: “We have to accept the younger generation speaking a different language and then we just work with it.”

What replaces line-by-line review

If reading AI-generated code line-by-line is the wrong abstraction layer for engineering effort, what replaces it? Paul’s answer is direct: AI evals.

“We need a way to ensure at the next level that everything works fine. AI evals — basically ensures from a data point of view that your product works fine. You don’t even need to write the classic unit and integration tests. You just from a data approach based on specific inputs and outputs ensure that your system works as expected.”

The shift is from code-level testing to behavior-level testing. Instead of asking “is this function written correctly?”, you ask “does this system produce the right outputs given the right inputs across the distribution of cases we care about?” The first question scales worse than the second when AI generates the code. The second question scales the same regardless of who wrote it.

This is also why Paul thinks AI evals will be the next major engineering discipline. “If AI can code, then the only thing you can do is to evaluate it.”

In his Agentic AI Engineering course, eval design is its own module — not because it’s a niche topic, but because it’s the new core skill. Evals replace much of what traditional unit and integration testing used to cover, with one critical difference: AI systems aren’t deterministic, so you can’t compare outputs character-by-character. Two different responses might mean the same thing. Evaluation requires designing custom business metrics calibrated to domain experts, not generic “hallucination scores” that don’t tell engineers what to fix.

Where this goes wrong

The reframe has limits. It works when the AI is operating inside a specific layer of the stack — generating UI components, writing CRUD endpoints, scaffolding tests, handling boilerplate. It doesn’t work when the AI is making architectural decisions, choosing data models, or designing system flows. Those still require human judgment, and not because of style — because the consequences of being wrong are different.

Paul is explicit about this. The engineer’s job isn’t going away. It’s relocating. “You should probably still decide how your system looks like, how your data models look like, how your data flows, how everything scales, the input output, the business problem that you solve. You’re still the one in charge that solves the problem, but you’re not necessarily the one in charge on how the code looks behind the scenes.”

The split is clean: engineers own what the system does. AI handles how the code expresses it. Engineers evaluate the behavior against the requirement. Nobody reads the bytecode.

What this means for engineers right now

If Paul is right, the engineering skills that compound from here aren’t pattern-matching cleaner code. They’re system design, eval design, and judgment about what to build. The engineers who’ll be most valuable are those who can:

Specify clearly. Tell AI what to build with enough precision that the output is correct, and recognize when the spec was the problem rather than the implementation.

Design evals. Build measurement systems that catch behavioral bugs in non-deterministic systems — including custom business metrics, not generic hallucination scores.

Hold architectural calls. Make the data-flow, scaling, and contract decisions that determine whether the system can survive its second year.

The engineers who will struggle are those whose skill set was largely formatting code to match team conventions. That work is being compiled away.

FAQ

Should engineers stop reviewing AI-generated code?

Not entirely — but stop reviewing it line-by-line for style and conventions. Review for behavior, system design, and edge cases. Treat AI-generated code like compiled output: trust the translation if the inputs and outputs are correct, audit the spec if the behavior is wrong. Code-level review is the wrong abstraction layer when AI handles the implementation.

What is the bytecode metaphor for AI code?

Python source compiles through multiple stages — AST, bytecode, machine instructions — and engineers don’t read each stage. They write source and trust the pipeline. Paul Iusztin argues AI generation is another layer in this pipeline. Engineers should specify intent and evaluate behavior, not audit the AI-generated code as if a human wrote it.

How should engineers evaluate AI-generated systems?

Through AI evals — input-output testing against the distribution of cases the system needs to handle. Custom business metrics calibrated to domain experts replace generic metrics like “hallucination scores.” Eval design includes offline benchmarks, online monitoring, and feedback loops that prioritize what to fix when AI gets things wrong.

What does Paul Iusztin say about clean code in the AI era?

Clean code conventions existed to reduce cognitive load for humans reading and maintaining code. AI doesn’t have that bottleneck. Conventions designed to prevent human error may not apply when AI writes the code. Engineers should evaluate behavior, not style — treating AI-generated code as a compilation step rather than as source for human review.

Will engineers still be needed if AI writes the code?

Yes — but the work shifts up a layer. Engineers own system design, data model decisions, scaling tradeoffs, eval design, and architectural judgment. AI handles boilerplate, scaffolding, CRUD code, and pattern-matching implementations. The valuable engineers are those who can specify clearly, design evals, and hold architectural calls — not those who format code to match conventions.

What’s wrong with current AI evals?

Generic metrics like “hallucination scores” don’t tell engineers what to fix. A score of 3.8 on hallucination is meaningless without knowing which specific behaviors triggered it. Effective evals require custom business metrics calibrated to domain experts, plus the ability to trace failures back to the input that caused them. Most AI eval tooling is too generic.

Should I learn vibe coding tools as a senior engineer?

Yes — Paul Iusztin describes himself as bullish on vibe coding even as a senior AI engineer. The leverage compounds: tools like Cursor and Claude Code extend skills into adjacent domains (front-end, design) without requiring full mastery. The engineers who avoid vibe coding will spend time on work that AI handles faster, while engineers who adopt it move up the value stack.

How does vibe coding work for production code?

It works when the AI generates code inside a well-specified scope — UI components, CRUD endpoints, scaffolding, boilerplate. It doesn’t work for architectural decisions or data-model design. The reliability question is solved through evals (does it behave correctly?) rather than code review (does it look correct?). The split: humans own what to build, AI handles how the code expresses it.

Why AI-Generated Code Is Just Bytecode

What clean code was actually for

The bytecode parallel

What replaces line-by-line review

Where this goes wrong

What this means for engineers right now

FAQ

More from Paul Iusztin

Related Insights

A CTO who thinks software engineers have 'several years' left.

When MCP Makes Sense (And When It Doesn't)

Why Most AI Teams Don't Actually Need RAG