When AI Agent Failures Are Actually Data Problems
Ahmed Rashad, CEO at Perle.ai
Your AI agent isn’t doing well enough. Most teams default to the same response: prompt engineer harder, try a different model, fine-tune. Sometimes that works. More often, it doesn’t, and the team burns weeks pulling on the wrong lever.
Ahmed Rashad — CEO of Perle.ai, where he’s built data labeling pipelines for medical, legal, dental, and embodied AI customers — has watched this pattern play out enough times that he has a specific issue tree for diagnosing it. The headline finding: most agent failures that survive prompt engineering are actually data problems, and the teams that figure that out fastest ship faster.
Here’s the framework, plus what to do when you’re staring at a failing agent and don’t know which knob to turn.
The Default Reflex Is Usually Wrong
When a model isn’t performing, Ahmed observes a consistent pattern: “What I see is a lot of researchers default to, oh, let’s make the algorithms better.”
The instinct is understandable. Algorithm changes feel controllable. You can fine-tune in a weekend. You can swap models in an hour. You can prompt engineer indefinitely. Data work feels intimidating by comparison — slower, less reproducible, harder to scope.
But there’s a cap on how much fine-tuning helps. “Everyone knows when the cap is hit,” Ahmed says, “but we just don’t wanna stop because it’s just so much fun.” The honest version of most agent debugging: teams keep pulling the algorithm lever past the point of diminishing returns because pulling the data lever feels like an admission that the cheap fixes won’t work.
The Issue Tree
Ahmed’s recommendation for any team debugging a non-performing agent is to test multiple levers in parallel before committing to any single one:
- Try prompt engineering — measure the impact
- Try giving the model more good data — measure the impact
- Try improving the quality of existing data — measure the impact
- Try combinations of the above — measure the impact
Then go where the biggest delta is. He frames this as a discipline, not a preference: “I would quickly experiment to figure out where I would get the biggest bang for my buck. I wouldn’t go all in on anything until I figure out where I would get the biggest returns.”
In Ahmed’s experience working with customers across verticals, the highest-ROI lever depends heavily on the domain. For tasks where data is plentiful and high-quality (well-resourced English-language tasks), prompt engineering and architecture changes often win. For tasks in specialized domains where data is scarce or messy (medical, legal, multilingual, embodied AI), data quality work usually dominates.
When the Data Lever Is the One
The signals that point to a data problem rather than an algorithm problem:
- Performance plateaus despite multiple prompt iterations. You’ve tried everything reasonable; the model keeps making the same kinds of errors.
- Errors cluster around edge cases or specific contexts. A 90% accuracy rate that fails consistently on the same 10% suggests the training data didn’t represent that 10%.
- The model performs worse in production than in eval. Eval data was cleaner than production data; the gap is the variables the model never saw.
- Domain experts review the output and identify systematic mistakes. A clinician spots that the AI is mislabeling penicillin-as-medication versus penicillin-as-allergy. A lawyer notices the contract analysis missed a specific clause type. These are data coverage problems, not model capacity problems.
When these signals show up, more fine-tuning won’t help. The training data doesn’t contain enough information about the failure cases.
Why Tracing Errors to Source Data Matters
A specific architectural recommendation Ahmed makes for AI agent systems: build feedback loops that trace failures back to source data, so you can improve labels and measure system performance together.
Most teams build their evaluation pipeline as a one-way street. The model produces output, the eval system scores it, the score becomes a number. What’s missing: the connection back to which training examples shaped this particular failure, and the workflow for fixing those examples and retraining.
The pipeline Ahmed describes for high-stakes verticals: an iterative LLM-in-the-loop labeling process where labels aren’t static inputs but get revised through multiple passes. When a production failure surfaces, the workflow can identify the underrepresented data, generate or label more of it, retrain, and measure improvement — all in the same system.
This is the architectural difference between treating data as a one-time setup cost and treating it as a continuous improvement loop. The teams running the loop ship better agents. The teams treating data as setup ship demos that fail in production.
When to Worry About Data Labeling
For teams just starting to build AI agents, the question of when to worry about data quality has a useful heuristic: as soon as your agent’s failure modes become visible, run the issue tree. If algorithm changes hit diminishing returns, the data lever is next.
For teams further along — already in production, already failing — the heuristic is different. Ahmed describes two main patterns of customers who arrive at his door: the people in panic on Friday night because something is launching Monday and the model is failing, and the people who anticipated the problem and reached out months early to architect their data pipeline correctly.
The middle case — calmly addressing data quality before it becomes a launch-blocking problem — is rare. Most teams don’t get there until the failure pattern is undeniable.
If you’re in the early phase, start now. Once your data lever is well-tuned, the algorithm lever does more work, not less. Both compound.
FAQ
How do I know if my AI agent failure is a data problem?
Signals that point to data: performance plateaus despite prompt iteration, errors cluster around edge cases or specific contexts, the model performs worse in production than in eval, and domain experts identify systematic mistakes. When prompt engineering hits diminishing returns and the model keeps making similar errors, the bottleneck is usually data quality or coverage, not model capacity.
What’s the right order of debugging for AI agent performance?
Test multiple levers in parallel before committing: prompt engineering, more data, better data quality, and combinations of these. Measure the impact of each. Ahmed Rashad’s advice: don’t go all in on any single lever until you know where the biggest ROI is. The right lever varies by domain — algorithm changes win in well-resourced domains, data quality wins in specialized verticals.
When should AI agent builders start worrying about data labeling?
As soon as failure modes become visible. If algorithm changes hit diminishing returns, data quality is the next lever. For teams already in production failing, the urgency is highest — Ahmed Rashad describes panicked Friday-night calls from teams launching Monday with failing models. The middle case (calmly addressing data quality before launch) is rare but ideal.
What does “expert in the loop” actually mean for AI agent systems?
Expert in the loop means domain specialists (clinicians, lawyers, linguists) review and refine training data, especially edge cases and ambiguous examples. Ahmed Rashad’s framing: experts aren’t just labeling, they’re contributing judgment that machines can’t replicate. The system extracts wisdom — variability, contextual interpretation, edge case handling — and bakes it into the training data.
Why does fine-tuning have a cap on improvement?
Fine-tuning works within the model’s existing knowledge structure, adapting it to a specific distribution. It can’t add capabilities the base model lacks, and it can’t compensate for missing training data coverage. Ahmed Rashad notes that “everyone knows when the cap is hit, but we just don’t wanna stop because it’s just so much fun.” Past the cap, additional fine-tuning produces no measurable gains.
What is iterative LLM-in-the-loop labeling?
Iterative LLM-in-the-loop labeling is a pipeline where labels are revised through multiple passes rather than treated as static inputs. An LLM generates initial labels, human experts validate and correct, the corrections feed back into model training, and the process repeats. This treats data labeling as a feedback loop rather than a front-loaded task.
How do you trace AI agent errors back to training data?
Build the evaluation pipeline as a closed loop, not a one-way street. When a production failure surfaces, the system should identify which training examples shaped that failure, generate or label more relevant data, retrain, and measure improvement. Most teams skip this loop and treat data as a one-time setup cost — which is why their agents plateau.
What’s the difference between scarce data and bad data?
Scarce data means not enough examples exist to train the model on the target distribution. Bad data means the examples exist but contain errors, ambiguity, or coverage gaps that mislead training. Both produce similar symptoms (poor production performance) but require different fixes — scarcity needs collection, badness needs better labeling.
How long should it take to diagnose an AI agent data problem?
The diagnosis should be quick — run the issue tree across prompt engineering, more data, better data quality, and combinations within a week or two. The fix can take longer depending on the data labor required. For high-stakes verticals where expert annotation is required, expect months of pipeline work to produce production-ready data.
Why do most AI agent systems fail in production despite working in demos?
Demos run on clean, predictable data. Production runs on messy, contextual, multilingual, edge-case-heavy data. The gap between demo data and production data is usually larger than teams expect, and the variables in production data are often not represented in the training set. Ahmed Rashad estimates a significant portion of the 95% production failure rate traces back to data coverage gaps.
Full episode coming soon
This conversation with Ahmed Rashad is on its way. Check out other episodes in the meantime.
Visit the ChannelMore from Ahmed Rashad
Founder Archetype
Read Ahmed Rashad's archetype profile
The Sage · Classical: Hephaestus · Tests & Allies