Founder Insight

Why Most RAG Pipelines Fail in Production — And How to Actually Debug Them

John Berryman, Founder at Arcturus Labs

Listen on TL;Listen Prefer to listen? Hear this article read aloud.

The most common RAG failure mode has nothing to do with your embedding model, your chunk size, or your vector database. It is that you are treating the whole system as one thing when it is actually several things stitched together. And when something breaks, you have no way to tell which piece broke.

John Berryman, founder of Arcturus Labs and early engineer on GitHub Copilot, spent his first career in search relevance before moving into AI. He co-authored the O’Reilly book “Prompt Engineering for LLMs” and now builds production AI systems across industries. His view on the RAG landscape is blunt: the black-box mindset that dominated early adoption is the root cause of most production failures.

The Black Box Problem

Most teams set up RAG the same way. Vector database, chunking strategy, embedding model, a framework like LangChain or LlamaIndex, and then they hope. When it works, nobody knows exactly why. When it breaks, nobody knows where.

Berryman comes from a search engineering background, and the way most people approach RAG baffles him. “It’s always been a puzzle to me why people choose to seed over control and all the details I know are hard details about search and just leave that inside this black box called rag.”

The framework-first approach — pip install, turn it on — skips the part where you understand what each component is actually doing. That works for demos. It fails in production because production users ask questions your demo never anticipated.

The Decomposition Framework

Berryman’s approach strips RAG down to two components: an agent and a search engine. That is it.

“RAG is an agent and it’s search. And that’s all there is at this level.”

Break it further. The agent is a for loop around a large language model call with a growing context and some tools. The search tool handles retrieval. Now you can debug each piece independently.

The agent layer: Is the model interpreting the user’s information need correctly? Is the query it formulates well-formed? These are prompt engineering and context engineering problems with known solutions.

The search layer: When the query hits the search engine, are the results sensible? If not, you have a relevance tuning problem — not a “RAG is broken” problem.

Berryman’s practical debugging method: take the top 50 queries your users actually submit, trace each one through the pipeline by hand, and identify where the chain breaks. “I’m going to look at the conversation and try to figure out where it got it wrong. Is the agent itself interpreting the user’s information incorrectly? Well, I can fix that.”

One reason RAG feels fragile is that most implementations default to semantic search as the only retrieval mechanism. Berryman argues this is often the wrong choice — or at minimum, an incomplete one.

Semantic search is powerful for matching ideas. A query for “gorilla suit” can surface documents about “monkey costume.” But it cannot do exact-match filtering on structured fields like price ranges, color attributes, or specific names. And for non-generic domains, out-of-the-box embedding models produce mediocre results without fine-tuning.

Lexical search, the older technology that RAG enthusiasm largely pushed aside, handles arbitrary filters cleanly. Adding more filter fields does not slow it down. Exact-match lookups on names, product IDs, or regulatory lists work reliably.

“People got so enamored with semantic search that they kind of forgot all the good bits about lexical search.”

The practical answer, Berryman suggests, is to let the agent decide which tool to use. Tell it: if the user needs an exact string match, use this field. If they need an idea match, use vector search. Or use both.

When You Do Not Need an Index at All

Berryman has also been exploring what he calls “Roaming RAG” — retrieval systems where agents navigate collapsed table-of-contents structures instead of querying any search index. The agent reads section headings, infers relevance, and expands the sections it needs.

“That’s still a RAG application. It’s just no index. It’s just traversing a file system.”

For certain use cases — internal documentation, structured knowledge bases, content organized in clear hierarchies — this approach eliminates the entire indexing infrastructure. It is simpler to maintain and gives the agent contextual awareness that disembodied vector chunks lack.

FAQ

Why do most RAG pipelines fail when moving from demo to production?

Most failures stem from treating RAG as a monolithic black box rather than decomposing it into its constituent parts: an agent layer and a search layer. Demos work because test queries are predictable. Production users submit queries the demo never anticipated, and without component-level debugging, teams cannot isolate whether the agent, the query formation, or the search relevance is the failure point.

How do you debug a RAG pipeline that returns irrelevant results?

Take the top 50 most common user queries and trace each through the pipeline manually. Check three things in order: (1) Is the agent interpreting the user’s intent correctly? (2) Is the query the agent generates well-formed? (3) Are the search results relevant to that query? Each failure point has different fixes — prompt engineering for the agent, relevance tuning for search.

Should I use vector search or lexical search for my RAG system?

It depends on the information need. Vector search matches ideas to ideas — “gorilla suit” finds “monkey costume” — but cannot do exact-match filtering on fields like price, name, or category. Lexical search handles arbitrary filters without slowing down and provides reliable exact matching. Many production systems benefit from combining both and letting the agent choose which tool fits each query.

What is Roaming RAG and when should you use it?

Roaming RAG uses agents that navigate collapsed table-of-contents structures instead of querying a search index. The agent reads section headings, infers relevance, and expands sections as needed. It works well for structured knowledge bases and internal documentation where content has clear hierarchies, eliminating the need for vector database infrastructure entirely.

What is the best RAG framework to use in production?

There is no silver bullet framework. According to John Berryman, the best approach is to build RAG according to your specific user’s information needs rather than defaulting to a pre-built framework. Figure out whether users need exact match, idea matching, or structured filtering — then build the pipeline components that serve those needs and debug each piece independently.

How do you evaluate whether your RAG system is working correctly?

Decoupling the agent and search components enables targeted evaluation. For the agent: check whether it correctly interprets user intent and generates well-formed queries. For search: check whether results are relevant to those queries. Evaluating RAG as one atomic system makes it impossible to identify which component is underperforming.

What skills do AI engineers need to build production RAG systems?

Search engineering fundamentals matter more than framework expertise. Understanding query formation, relevance tuning, the trade-offs between lexical and semantic search, and how to decompose pipelines into debuggable components are the critical skills. Five years ago this required PhD-level ML knowledge; today the AI model is an API call, but the search architecture decisions still require human judgment.

How does Arcturus Labs approach RAG consulting projects?

Arcturus Labs builds production AI systems from first principles rather than applying a standard template. Each project starts with understanding the specific user’s information needs — what they search for, whether they need exact matches or conceptual similarity, what filters matter — and then constructs a pipeline of indexing, searching, interpretation, and context management tailored to those needs.

Full episode coming soon

This conversation with John Berryman is on its way. Check out other episodes in the meantime.

Visit the Channel

More from John Berryman

Related Insights