Why 80% of RAG Pipelines Fail in Production
Gil Feig, CTO at Merge
RAG looked like the answer to everything. Ground your LLM in real data, reduce hallucinations, give users accurate answers. Simple in theory. Brutal in practice.
Gil Feig has spent the last two years watching companies build RAG pipelines that work perfectly in demos and fall apart the moment real users touch them. The failure rate is staggering — and the reasons are almost always the same.
The Demo-to-Production Gap
In a demo, you control everything. Clean documents, predictable queries, a handful of test cases. In production, you get scanned PDFs with OCR errors, users who ask questions in ways you never imagined, and documents that were last updated three years ago.
“The gap between a RAG demo and a RAG product is about 10x the effort most teams estimate,” Gil says. “And 80% of that effort is in the boring parts — data cleaning, chunking strategy, retrieval tuning.”
The Three Failure Modes
Retrieval failure: The system finds the wrong chunks. This happens when your chunking strategy doesn’t match how users think about information. Fixed-size chunks are the default, but they’re almost always wrong for domain-specific content.
Synthesis failure: The right chunks are retrieved, but the LLM generates a wrong or misleading answer. This is especially common when retrieved chunks contain contradictory information from different time periods.
Freshness failure: The knowledge base is stale. Documents change, products update, policies evolve. Without a reliable refresh pipeline, your RAG system confidently serves yesterday’s answers to today’s questions.
What Production-Ready RAG Looks Like
The teams that get RAG right in production share a common approach: they treat it as a search engineering problem first and an AI problem second.
This means investing in evaluation before scaling, building feedback loops from day one, and accepting that the first version of your chunking strategy will be wrong.
FAQ
What’s the most common RAG failure in production?
Retrieval quality — finding the wrong chunks. Most teams underinvest in chunking strategy and retrieval tuning, focusing instead on the generation side.
How do you measure RAG quality?
Track retrieval precision and recall separately from generation quality. Use human evaluation on a random sample weekly, and build automated regression tests for known-good query-answer pairs.
When should you not use RAG?
When your use case requires real-time data, when the source documents change faster than your indexing pipeline, or when the answers require complex multi-step reasoning across dozens of documents.
Watch the full conversation
Hear Gil Feig share the full story on Heroes Behind AI.
Watch on YouTube