Founder Insight

Why Most AI Teams Don't Actually Need RAG

Paul Iusztin, Founding AI Engineer & Author of LLM Engineer's Handbook at Decoding AI

Listen on TL;Listen Prefer to listen? Hear this article read aloud.

Most teams building AI agents reach for RAG by default. The pattern is so reflexive that “do we need RAG?” rarely gets asked — engineers just start with retrieval and tune from there. Then latency creeps up, costs balloon, and the team blames the model.

Paul Iusztin, the founding AI engineer behind a vertical AI startup for financial advisors and the author of LLM Engineer’s Handbook (~20,000 copies sold), spent a year building exactly this kind of stack. RAG, MCP, agentic loops — the full set. He recently tore most of it out and wrote about why in “We Killed RAG, MCP, and Agentic Loops.” The article landed across the AI engineering community because it names something practitioners had been quietly noticing: the standard stack is overkill for most production use cases.

His argument isn’t that RAG is bad. It’s that RAG is misapplied — used by default in situations where the simpler approach (loading everything into the context window) outperforms it on cost, latency, and reliability.

The 64,000-token threshold

The reframe starts with a specific number. In Paul’s startup, every financial advisor’s full document corpus — every report, every client file, every memo — sums to about 64,000 tokens at maximum. That’s well within the context window of any modern model.

“For most of our use cases, if we load all the data that an advisor has into the context window, it will just sum up to 64,000 tokens at maximum,” Paul explains. “It’s just easier just to load everything into the context window and just pass it to the model.”

The implication is uncomfortable for teams who’ve already built RAG infrastructure. If the data fits, retrieval adds work without adding value. The model gets the same information either way — but with retrieval, you’ve added a database, a chunking strategy, an embedding pipeline, and an extra round of failure modes.

Why retrieval hurts when data fits

Paul’s argument against retrieval-by-default isn’t just architectural elegance. It’s that retrieval, when it isn’t strictly necessary, actively degrades performance.

Here’s what happens in practice. The model retrieves chunks based on the user’s query. It tries to answer. If the right context isn’t in the chunks, the model has two options: hallucinate, or query memory again. A well-built agent does the second.

“The model, if it doesn’t find the answer in the data it pulled, it will start to query the memory again and again and again until it finds. And this is if you’re lucky enough and the agent realizes that it doesn’t have the right data and it doesn’t start to hallucinate,” Paul says. “So it’s basically this agentic loop, which is like a zigzag — it just adds extra latency and potentially even extra costs because you do many more LLM calls to answer your question.”

Each retry is another LLM call. Each call is more latency, more spend. For a single user request, you might trigger three or four retrievals before the model gives up and answers with what it has.

Angelina reframed this during the recording with an analogy that landed: imagine asking your intern to fetch a file from the cabinet, then double-check it. The intern wants to be helpful. So they go back and check again. And again. The recursive loop happens because the system cares about being right — but it’s expensive, and the human-equivalent loop you’d never tolerate is exactly what your agentic system is doing.

When RAG actually makes sense

Paul isn’t anti-RAG. He’s anti-default. RAG still earns its place in the stack under specific conditions:

Medium-to-large data. If a single user’s relevant context exceeds the context window, retrieval becomes necessary. The threshold depends on the model — 128K-token windows are now common, 1M-token windows are emerging — but the principle holds: ratio of data size to context window determines whether you need retrieval at all.

Aggressively constrained context. If you’re running a smaller, cheaper model (cost optimization) and need to maximize what fits in a tight window, RAG lets you select the most relevant chunks rather than dumping everything in.

Mixed-source retrieval. When data lives across multiple systems (documents + databases + APIs) and the agent needs to dynamically choose what to pull, retrieval orchestration is genuinely the right pattern.

The decision rule Paul offers is simple enough to apply on the back of a napkin. Calculate two numbers: the size of your typical user’s relevant data, and the size of your context window. If data fits in window with room to spare, skip RAG. If data exceeds window, RAG is back on the table.

What replaces RAG when data fits

The replacement isn’t dramatic. It’s the simplest possible architecture: load everything in, prompt the model, return the answer. Add metadata filters or simple SQL queries if you need to narrow data before loading. Use prompt engineering and context engineering to structure what the model sees.

Paul applied this same pattern in a deep-research agent and an article-writing agent — both capstone projects in his new Agentic AI Engineering course. “To write a professional article you often need as references five, six other articles. Your first thought is that yeah, if I want this agent that writes this article to have access to all my knowledge, I need RAG, like it’s obvious. But no, it was just a lot easier to put everything in there. With some smart prompt engineering, maybe some context engineering.”

The discipline Paul argues for is harder than it sounds. It’s the willingness to remove sophistication after you’ve already built it — to recognize that the fancy version you spent three months on is the wrong tool for the problem, and to swap it for the boring solution. “Usually the answer was more in simplicity than following all these fancy algorithms,” he says. “But it’s hard to find this simplicity. It’s the hardest.”

FAQ

When does RAG actually make sense for an AI agent?

RAG makes sense when relevant data per user exceeds the context window — typically medium-to-large datasets — or when running a smaller, cheaper model that requires aggressive context management. For data that fits in 64K-128K tokens, loading directly into the context window is usually faster and cheaper than retrieval-based approaches.

What is the 64,000-token threshold for skipping RAG?

In Paul Iusztin’s vertical AI agent for financial advisors, every advisor’s complete document corpus summed to roughly 64,000 tokens. Below this threshold, loading all data into context outperforms RAG on cost, latency, and reliability. The threshold scales with model context windows — 128K is more common today, 1M is emerging.

What is the agentic loop problem with RAG?

When RAG retrieval misses the right data, the agent re-queries memory in a recursive loop, similar to a diligent assistant double-checking a file cabinet. Each iteration adds an LLM call, increasing latency and cost. A single user request can trigger three or four retrievals before resolution. Loading the full context once is often faster.

How do I decide between RAG and context-loading for my AI agent?

Calculate two numbers: the size of typical user data, and your model’s context window. If data fits in window with room to spare, skip RAG and load directly. If data exceeds window, retrieval becomes necessary. The ratio of data-to-window is the decision rule, not framework defaults or industry trends.

Does context-loading work for production AI agents?

Yes — Paul Iusztin uses this pattern in production for vertical AI agents in financial services and in two capstone projects in his Agentic AI Engineering course (deep research agent, article-writing agent). With prompt engineering and metadata filtering, context-loading replaces RAG cleanly when data fits in the window.

Why does RAG add latency to AI agents?

RAG adds latency through two paths: the retrieval itself (database query + ranking) and the recursive re-query loop when the model doesn’t find what it needs in the initial chunks. Each retry is another LLM call. Loading full context once costs more per call but often less in total time and money.

What’s the alternative to RAG when context-loading isn’t enough?

When data exceeds the context window but RAG feels like overkill, simple SQL queries with metadata filters and sorting often outperform vector retrieval. Filter your data by user ID, date range, or category before loading. Use search rather than semantic retrieval for keyword-driven queries. The principle: match the technique to the data.

Should startups start with RAG or context-loading?

Start with context-loading. It’s simpler to build, faster to debug, and reveals what your actual data volume looks like. Add RAG only when measurement shows you’re hitting context window limits or running into specific cost constraints. Building the simple version first prevents the most common AI startup failure: over-engineering before product-market fit.

What are the signs your team has over-engineered RAG?

Signs include: average user data fitting comfortably in your context window, agentic loops re-querying memory more than once per user request, RAG infrastructure consuming more engineering time than your core product features, and the team unable to articulate why RAG was chosen beyond “it’s the standard.” If any apply, audit whether retrieval is earning its complexity.

Full episode coming soon

This conversation with Paul Iusztin is on its way. Check out other episodes in the meantime.

Visit the Channel

More from Paul Iusztin

Founder Archetype

Read Paul Iusztin's archetype profile

The Sage · Classical: Lao Tzu · The Return

Related Insights