Why Older AI Models Sometimes Beat Newer Ones in Production

Most teams upgrading to the latest frontier model assume accuracy will improve. In contract processing at scale, the opposite sometimes happens.

Deepak Bapat, CTO and co-founder of Tabs — the $91M-backed AI platform that automates contract-to-cash for finance teams like Cursor and Statsig — discovered something that runs against the industry’s upgrade-first instinct. Some of Tabs’ earlier models do a better job processing contracts than newer ones, given the same context inputs. The difference isn’t the model. It’s everything around it.

The Mid-80s Wall

When Tabs feeds a complex contract straight into a frontier LLM — no context objects, no classification layer — accuracy drops to the mid-80s. That sounds respectable until you realize a single wrong field on an invoice can reset a payment cycle that takes months to complete.

“If you are a simple customer with a one page order form with a flat price that you’re charging annually upfront, yes, you can feed it to Claude,” Bapat explains. “But as it gets more complicated, you’re going to see that these frontier labs, the generic models without the right context are going to start to degrade in performance down into the mid 80s.”

The problem compounds with amendments, usage-based pricing, tiered commitments, and Fortune 500 paper requirements. Each adds a layer of ambiguity that raw model capability cannot resolve.

Context Objects as the Actual Moat

Tabs builds what Bapat calls “context objects” during customer onboarding — structured representations generated by feeding in all existing contracts and comparing them against patterns from 200+ other merchants. These objects carry the history, terminology, and business rules specific to each customer relationship.

With context objects, extraction accuracy stays in the high 90s. Without them, even the best models plateau. And here’s the counterintuitive finding: “There are some earlier models that actually do a better job of being able to process contracts with the right context object than newer models with that same context object.”

Bapat suspects newer models may be overtrained on coding problems and overthink certain document interpretation tasks. But the mechanism matters less than the implication: if you can solve the problem with a simpler model and better context, you should.

The Cost Argument No One Makes

The industry conversation about model selection focuses almost entirely on capability. Bapat adds a dimension most teams ignore — fiduciary responsibility to customers.

“If you can figure out how to use a more simple model and use context to drive outcomes that are actually satisfactory, what you are doing is you’re doing the best thing for your customer, which is you are actually reducing costs for them while also being able to maintain a level of efficacy.”

He references an NVIDIA executive who recently pointed out that in some cases, hiring people is cheaper than paying for compute with frontier LLMs. As models grow larger and more expensive, teams that pin themselves to simpler architectures — “the Haikus, not the Opuses,” as Bapat puts it — may end up with both better accuracy and lower costs.

What This Means for Production AI Teams

This isn’t an argument against model upgrades. It’s an argument for sequencing. Tabs’ contract processing pipeline doesn’t start with the LLM — it starts with classification (embeddings and cosine similarity), then context object construction, then structured output extraction, then deterministic computation for invoicing. The model is one step in a pipeline where most of the accuracy comes from context preparation.

“I actually don’t think it’s just a model problem. I actually do think there is a context problem here,” Bapat says. “It is a context engineering problem at the end of the day that we need to solve for if we want to be good fiduciaries and good stewards of our customers.”

Teams spending engineering cycles on model migration might get more accuracy per dollar by investing those cycles in context engineering instead.

FAQ

Why does contract processing accuracy drop with generic AI models?

Generic frontier models without domain context hit mid-80s accuracy on complex contracts — those with amendments, usage-based pricing, or enterprise paper requirements. Each contract type introduces ambiguity that raw model capability cannot resolve. Adding structured context objects built from customer-specific patterns pushes accuracy into the high 90s.

How does Tabs achieve high accuracy on contract extraction?

During onboarding, Tabs generates “context objects” by ingesting all existing customer contracts and comparing patterns across 200+ merchants. These structured objects carry business rules, terminology, and relationship history. When passed alongside new contracts to the LLM, they keep extraction accuracy in the high 90s — better than manual human processing.

What is context engineering in production AI systems?

Context engineering is building the structured data pipeline around an AI model — classification layers, embeddings, historical patterns, domain-specific rules — so the model receives exactly the information needed to perform well. At Tabs, context objects built during onboarding matter more to accuracy than which frontier model is used.

When should AI teams upgrade models vs. improve context?

If accuracy plateaus despite using the latest model, the bottleneck is likely context, not capability. Tabs found that earlier models with strong context objects outperform newer models with the same context. Teams should exhaust context improvements before investing in model migration — the accuracy gains per engineering hour are typically higher.

Why do newer AI models sometimes perform worse than older ones?

Newer models may be over-optimized for coding tasks and overthink document interpretation problems. With identical context objects, Tabs observed earlier models extracting contract terms more accurately than newer ones. The exact mechanism varies, but the pattern suggests that model capability and task-specific performance don’t always correlate.

How does Tabs reduce AI compute costs for customers?

By pinning to simpler, less expensive models — what Bapat calls “the Haikus, not the Opuses” — and achieving target accuracy through context engineering rather than model power. This approach reduces per-contract processing costs while maintaining high-90s accuracy, passing the savings to merchants running thousands of contracts monthly.

What is the contract-to-cash pipeline at Tabs?

Contracts enter the system and get classified via embeddings and cosine similarity. Context objects built from onboarding data are passed to an LLM for structured extraction. Deterministic algorithms then generate invoices and performance obligations. Anomaly detection via PCA flags unusual contracts for human review. The system auto-calibrates when merchants make corrections.

Does upgrading to GPT-5 or Claude Opus automatically improve AI accuracy?

Not necessarily. Tabs found that accuracy depends more on context quality than model generation. A paper Bapat references — “Even GPT-5.2 Can’t Count to Five” — illustrates that frontier models still struggle with basic tasks. Structured context objects and deterministic validation layers provide more reliable accuracy improvement than model swaps.

How do you measure AI accuracy in financial document processing?

Tabs uses a weighted accuracy equation across all extracted fields — contract terms, pricing, commitments, amendment details. A single wrong field can reset payment timelines, so the metric weights critical fields higher. This weighted approach reveals that overall model benchmark scores don’t predict field-level accuracy in domain-specific documents.

Why Older AI Models Sometimes Beat Newer Ones in Production

The Mid-80s Wall

Context Objects as the Actual Moat

The Cost Argument No One Makes

What This Means for Production AI Teams

FAQ

More from Deepak Bapat

Related Insights

Why Statisticians and Control Engineers Disagree About AI Hallucination

Why Hallucination Is a Selection Error, Not an AI Flaw

How Customer Data Becomes the Context Window for AI Marketing Models