Screenshots Beat HTML: Why Web Agents Need to See Like Human

Q: What about inaccessible websites with poor semantic HTML?

Screenshot-based agents actually handle these *better* than DOM-based agents. Poor semantic HTML is still poor HTML — the noise problem gets worse. But if the website is visually navigable by a human, it's navigable by an agent that sees screenshots. The rendering layer flattens the structural chaos.

When Yutori’s engineering team started building their web automation agent, the obvious approach was to parse the DOM. Feed the model the HTML, let it reason about the structure, and decide what to click. It’s what every textbook says: machines should read machine-readable code, not visual renderings meant for human eyes.

They tried it. It didn’t work at scale.

“When we first got started, our thinking was that the rendering of the website is very much intended for a human to consume it,” Devi Parikh explains. “But behind it is this machine code — the HTML, the DOM, all of that information. So it makes sense for the machine to just use that directly. Why use this rendering?”

The logic was sound. The assumption was wrong.

The Noise Problem

Different websites are built in radically different ways. Some use semantic HTML. Some are built on legacy systems with decades of accumulated code. Some are modern single-page applications with JavaScript rendering everything. Some have accessibility overlays. Some have hidden elements that never render. Some have placeholder divs. Some have deprecated code nobody has bothered to clean up.

When you feed raw HTML to a model and ask it to navigate, it’s swimming in noise. “Different websites are built in such different ways that there is so much noise if you take that raw HTML DOM information that it’s very hard to train models to be reliable enough,” Parikh says.

The model has to learn not just what elements are clickable, but which elements are actually visible. Not just what text is in the DOM, but which text a user can actually see. Not just what structure the page has, but which structure matters. It’s like asking someone to navigate New York using an unedited architectural blueprint instead of looking at the street.

“Your context also blows up. You’re putting in these massive numbers of tokens to get to that,” Parikh adds. Process the screenshot? A few thousand tokens. Process the raw HTML of a complex page? Potentially 20,000-50,000 tokens. Multiply that by every step an agent takes, every page it navigates, every scout running for every user.

The Generality Problem

Even if you solved the noise problem, you’d face a worse one: scaling to arbitrary websites. You can write parsing rules for a specific site — “on this site, buttons are always in this CSS class, forms always have this structure.” That works beautifully. You build a clean pipeline, the model can be lightweight, latency is fast, costs are minimal.

But Yutori’s product isn’t built for one website. It’s built for any website. Scouts monitor whatever the user wants — a specific product price, apartment availability in a city they’re moving to, news about a startup, competitor reviews on niche platforms. The websites are random. The structures are unpredictable.

“If you wanted to build this for any one website, it would make a lot of sense to do it this way,” Parikh explains. “But if you’re trying to build web agents that are general, that can interact with websites just across the entire web, this is just not going to be scalable over time. The way in which you parse and sort of clean this up becomes so website-specific that you just can’t scale this.”

You’d need a custom parser for Airbnb, another for concert ticket sites, another for Reddit, another for product pricing pages. Hundreds of sites. The engineering cost is prohibitive. The fragility is unbearable.

Screenshot-based agents have a single interface: what the user sees. That interface is the same whether the underlying HTML is beautiful or a disaster. “What we found is that just using the screenshot is what gives you that generality.”

The Scaling Implication

This choice cascades through the entire architecture. Screenshots are more expensive to process than text. A screenshot of a webpage is thousands of pixels. Encoding that into tokens, passing it to a model, getting a decision back — it costs more per step than reading HTML and generating coordinates.

But the cost is paid once per step, and the generality is infinite. One screenshot-based agent can handle Netflix and a local boutique’s booking system and a government website with broken styling and Reddit and LinkedIn and everything in between. You don’t maintain 100 custom parsers. You maintain one visual-understanding system.

The larger implication is that this approach converged on independently across the AI research world. “I think there are teams who over time have also figured this out,” Parikh notes. “And I see the larger labs probably have already arrived at similar conclusions as we have.” When different organizations solving the same problem arrive at the same solution, it’s a signal that you’ve hit an architectural truth.

The trade-offs are clear: higher token cost per action, but general-purpose capability. For a product like Scouts, which monitors diverse websites and runs continuously for many users, the economics only work because you offset the per-action cost with a custom in-house model that’s cheaper to run than a foundation model.

Why Rendering Doesn’t Fully Solve It

You might think: well, what if we render the HTML to completion (using something like Selenium), then parse the rendered result? That way we get the visual output plus the cleaned-up structure.

“But even if you render the page, the underlying code is still very messy,” Parikh explains. “There’s all sorts of things. There’s a whole bunch of hidden elements that are being described in the HTML that are not being rendered. And your model is now going to have to figure out that this stuff that’s being described is actually not even available. It’s not even something that you can interact with.”

Hidden elements. Modal overlays that cover other content. Elements that are visually stacked but appear in the DOM in a different order. Responsive designs that hide elements on mobile. Skeleton loaders that haven’t finished rendering. All of this is still noise in the HTML even after rendering.

The screenshot short-circuits all of it. What you see is what you can interact with. If an element isn’t in the screenshot, the model doesn’t have to reason about whether it’s hidden or just not rendered. It’s simply not in the frame.

FAQ

Doesn’t Claude’s browser extension use a similar approach?

Yes. Large AI labs discovered the same thing independently. Claude’s computer-use model takes screenshots, decides on an action, executes it, and takes the next screenshot. That’s the state-of-the-art for general-purpose web automation. The industry has converged on screenshots as the standard interface.

What about APIs? Aren’t they more efficient?

Yes. If a website has an API, using it is faster and cheaper. Yutori uses APIs for flights, LinkedIn, Reddit, and dozens of other platforms. But most websites don’t expose their data via API. Scouts uses screenshot-based navigation for those sites. APIs for the willing, screenshots for the rest.

Does this mean every web agent needs to be expensive?

Only if it uses a foundation model for every action. Yutori trained a custom in-house model specifically for web navigation. Custom models are expensive to build but much cheaper to run at scale. For a product that runs thousands of agent steps per day, the investment pays off quickly.

Can you cache screenshots to reduce tokens?

Partially. You can store screenshots between runs and compare changes. But each action still requires a fresh screenshot and a model inference to decide the next step. You can optimize at the margins, but you can’t escape the fundamental cost of visual processing.

What if JavaScript rendering is broken on a page?

That’s a problem regardless of whether you’re using HTML or screenshots. A JavaScript error that breaks the page breaks it for the agent too. But the agent can still fall back to what is visible. A broken DOM is harder to parse than a broken screenshot.

How does the model learn what’s clickable from a screenshot?

It learns from training data. You show the model many pairs of (screenshot, action). The model learns to recognize buttons, links, form fields, and other interactive elements from their visual appearance. This is why pre-training on human web-use data is critical — the model learns human conventions for interface design.

What about inaccessible websites with poor semantic HTML?

Screenshot-based agents actually handle these better than DOM-based agents. Poor semantic HTML is still poor HTML — the noise problem gets worse. But if the website is visually navigable by a human, it’s navigable by an agent that sees screenshots. The rendering layer flattens the structural chaos.

Does this approach work on mobile sites?

Yes. Mobile sites are still rendered in the browser. The agent takes a screenshot of whatever size the viewport is (mobile, tablet, desktop) and navigates accordingly. Yutori can monitor mobile sites just as effectively as desktop.

What’s the latency penalty for processing screenshots?

It depends on the model, but processing a screenshot typically adds 1-3 seconds per action compared to pure HTML parsing. For monitoring products where actions happen every few minutes or hours, that overhead is negligible. For systems that need sub-second response times, this approach wouldn’t work. But for most web automation tasks, the latency is acceptable.

Will this ever change? Could HTML parsing become viable again?

It would require widespread adoption of web standards that make HTML structurally cleaner and more consistent. That’s not happening. The web is messier and more diverse than it’s ever been. Screenshot-based agents are likely to remain the standard for general-purpose automation for the foreseeable future.

Screenshots Beat HTML: Why Web Agents Need to See Like Humans

The Noise Problem

The Generality Problem

The Scaling Implication

Why Rendering Doesn’t Fully Solve It

FAQ

More from Devi Parikh

Related Insights

How a 10-Person Startup Built Enterprise-Grade Biometric Security

Content Assembly vs Content Generation: Why the Distinction Matters for AI Marketing

Expensive AI Models Overthink — When Cheaper Models Win