How to Stop LLM Agents From Getting Stuck in Loops

The scariest moment in building with LLMs is when an agent gets stuck. It tries to click a button, the click fails, it tries again. And again. And again. Your token budget burns. The task never completes. You watch $100 disappear while the agent remains convinced it’s making progress.

This is the hallucination loop—when an LLM model doesn’t understand that its action failed and keeps repeating the same thing in slightly different ways, convinced it will eventually work.

Magnus Müller has debugged hundreds of these. He’s clear about what causes them and what actually stops them.

What Triggers a Loop

Different models behave differently when they hit an error. Some models see a captcha and think: “I should try harder. Maybe I didn’t click in the right spot.” They try clicking the same captcha from different angles. The click still fails. They try again.

“Some LLMs, when they like seeing capture and they cannot solve it, they would just try forever. They would try until you run out of money,” Magnus explains.

Other models are more pragmatic. They see the captcha and think: “I’m not allowed to solve this. I need to stop here.” These models break immediately and ask for help. They fail, but quickly.

The first model is dangerous. It will burn through your token budget. The second is safer but still problematic—it might give up too easily on tasks that are actually solvable.

The Root Cause Matters

When Magnus’s team sees a stuck agent, they don’t just throw more prompt engineering at it. They diagnose: what’s the root cause of the loop?

“Most often it happens, oh, it tried to click, but the click didn’t work. Oh, then it’s much better to fix the click action than to actually fix a loop.”

This is the key insight: looping is often a symptom, not the disease. If the agent is trying to click the same button repeatedly and it’s not working, the real problem might be:

The click coordinates are wrong — the element moved, or the selector changed
The click isn’t being registered — the page requires a delay before interaction
The element isn’t visible — it’s behind a modal or in a collapsed section
The page state changed — the button disappeared, or a different page loaded

Fixing the loop by adding instructions (“Try 3 times then stop”) treats the symptom. Fixing the root cause (improve click accuracy, add delays, check element visibility) treats the disease.

Practical Loop Detection Rules

Magnus’s team built specific rules to catch loops before they waste tokens:

Rule 1: Same action, no change

“If you try like your approach three times, but nothing changes, try a different approach.”

The agent tries to click button X. The state doesn’t change. It tries again. Still no change. After three identical attempts, the rule triggers: “Stop. This approach isn’t working. Try something else.”

This is simple but effective. It catches the most common loop immediately.

Rule 2: Repeated tool calls without progress

The agent calls the same tool (click, scroll, type) with identical or near-identical parameters multiple times. This suggests the action isn’t working, not that the agent is strategically retrying.

Rule 3: Escalating confidence with zero results

Some models do something peculiar: they retry and become more confident. “The click definitely worked this time” (it didn’t). This pattern—increasing confidence despite evidence of failure—is a hallucination loop waiting to happen.

System Prompt Design to Prevent Loops

The system prompt can reduce loop probability, but it’s fragile. Magnus is cautious here: “System prompt engineering is like coding, like one single line change can change your entire behavior of the system.”

A prompt that says “Be cautious and ask for help” might make the agent too conservative—it gives up on solvable tasks. A prompt that says “Be persistent” might create loops.

Magnus’s approach is to build loop-breaking into the execution architecture, not the prompt. The prompt describes intentions (“complete the task efficiently”), and the infrastructure enforces constraints (“if an action doesn’t change state 3 times, escalate”).

This decouples the general philosophy (be persistent) from the specific policy (but not infinitely).

The Role of Observability

When a loop happens despite safeguards, the real power is observability. Magnus’s team uses logging and session recording to understand exactly what went wrong.

“For this we use Laminar. It’s like a tool… basically in here I can see exactly what’s our system prompt, what went into the LLM, how does the message history look like.”

They can replay the session and see: where did the agent diverge from success? Was it a click that didn’t register? A page load that took too long? A hidden UI element?

With this visibility, the team can decide: is this a systemic issue (fix the system prompt or the evaluation dataset) or a one-off (accept the failure, move on)?

Different Models, Different Behaviors

The model you use matters enormously. Magnus has observed Claude, GPT-4, and other models in production. They fail differently.

“And you can like, I can really experience how different LLMs behave.”

Some models are prone to loops when they see obstacles they’re not trained to handle. Others are more resilient. When Magnus’s team evaluates a new model, they run it through edge cases specifically designed to trigger loops: captchas, authentication challenges, ambiguous UI.

A model that loops on 10% of these cases is riskier than one that loops on 2%. This becomes a criterion for model selection—not just accuracy, but failure behavior under stress.

When to Just Let It Fail

Sometimes the right answer isn’t to fix the loop — it’s to accept failure. Magnus encountered a task in his evaluation platform where the user’s request was fundamentally ambiguous: “Find a comment on the Godzilla fandom page.”

Which comment? There are thousands. Without more specificity, any agent will loop or guess. Magnus’s insight: “If I run this task three times, maybe in one time the agent says, please provide login credentials. Another task, maybe the agent goes and tries to create a new account.”

High variance means high uncertainty. The task itself is broken, not the agent.

Building Loop Prevention Into Your System

If you’re building agents, Magnus’s framework applies:

Log every tool call — you need to see what the agent is doing
Detect action failure — know when a click didn’t register, not by checking LLM confidence but by checking actual state change
Enforce retry limits — if the same action fails 3 times, force a strategy change
Differentiate between models — test your target models against edge cases that trigger loops
Accept some failures — not every task is solvable; design for graceful degradation

The goal isn’t to eliminate loops—that’s impossible with nondeterministic systems. It’s to detect them early, understand why they happen, and fail fast with useful diagnostic information.

FAQ

How do I know if my agent is in a loop vs. actually making progress?

Monitor state changes. If the agent has taken 10 actions and the page state is identical to 10 actions ago, it’s looping. If the page state keeps changing (scrolling to new content, pages loading), it’s exploring, even if slowly. Loops show zero progress. Slow progress still shows change.

Should I limit the total number of agent steps?

Yes. A simple step limit (like max 100 steps for a task) is a coarse but effective brake. Most tasks complete in 5-20 steps. If you hit 100 steps, something is wrong. Combined with the “same action 3 times” rule, step limits are a good failsafe.

Can I detect loops without detailed logging?

Partially. A timeout-based approach works—if a task doesn’t complete in X minutes, kill it. But you lose diagnostic information. If 10% of your tasks are hitting loops, you want to know why, not just that they timed out. Invest in logging.

Do newer models loop less than older ones?

Not necessarily. Loop risk depends more on how a model handles novel situations than its overall capability. Some advanced models can be overconfident in new domains. Some simpler models are more conservative. Test your specific model against edge cases.

Is this a problem I should solve, or should I use a framework?

It depends on your timeline. Building robust loop detection and recovery takes weeks of engineering. Most teams are better off using Browser Use or similar frameworks that have already solved this. If you have the time, building it teaches you a lot about how LLMs fail.

How much should I allocate to loop detection in my system design?

In the Browser Use architecture, about 30% of the complexity goes to handling failures and edge cases (including loop detection). If you’re building lightweight, maybe 10-15%. But don’t ignore it—loops are the difference between an impressive demo and a broken production system.

What’s the difference between a loop and a legitimate retry?

A legitimate retry is when an agent tries a different action after the first fails. A loop is the same action repeated. Legitimate retries show learning (“that didn’t work, let me try this instead”). Loops show confusion. Your detection rule should enforce this distinction.