Expensive AI Models Overthink — When Cheaper Models Win

There is an assumption in AI development that mirrors consumer psychology: more expensive means better. If a state-of-the-art model costs ten times what an open-source alternative costs, it must produce ten times better output. Right?

Nicole Königstein ran the experiments and the results surprised her. Königstein is the CEO of Quantmate — an agentic quant research environment where trading strategies evolve from natural language prompts to live testing — and is currently authoring two O’Reilly definitive guides on transformers and AI agents. When she compared state-of-the-art models against smaller open-source models across multiple tasks, the expensive models didn’t just cost more. They actively made things worse on some tasks by overthinking.

“The more expensive ones, I’m not kidding you, did overthink literally their tasks and spend compared to a smaller open source model — at some points, in some cases, times as much tokens with not as good results,” Königstein says.

What overthinking looks like in practice

Königstein tested models from Anthropic, OpenAI, Qwen, Grok, and Gemini across 10 different tasks and datasets with varying difficulty. The pattern was consistent: for straightforward tasks, the models mostly agreed. For more complex multi-step tasks, disagreement grew. But the surprising finding wasn’t about accuracy — it was about reasoning efficiency.

State-of-the-art models with extended thinking capabilities would loop through their reasoning, revisiting the same considerations multiple times before producing output. The token meter kept climbing while the quality plateaued or declined. A smaller model would complete the same task in a fraction of the tokens because it didn’t have the capacity (or the tendency) to second-guess itself.

Königstein tested the extended thinking settings available through Open Router, where you can adjust thinking depth from medium to high to extreme. Her recommendation is unambiguous: “I would never recommend to use that extreme X high. I tested it and it just blows up in your face.”

The model selection problem nobody is solving

The counterintuitive finding creates a practical headache. If the best model depends on the task, and the wrong model for a given task can cost multiples of the right one with worse output, then model selection itself becomes a critical engineering decision.

Most teams default to one model across their entire agent system. It is the simplest choice and usually the most expensive state-of-the-art option, because nobody wants to be the person who chose the cheap model when something breaks. But Königstein’s research shows this is precisely backward.

“A lot of time the cheaper model got preferred because it was just doing fine for that task,” she says. “It’s not always the most expensive one needs to be over the top notch needs to be all the time.”

The problem is that evaluating which model works best for which subtask is itself expensive and time-consuming. It requires running the same tasks across multiple providers, comparing outputs, measuring token consumption, and checking for quality differences that may only appear in edge cases.

Automated model selection as infrastructure

This problem is what led Königstein to build Agenens Flow — a tool she plans to open source that automates model selection. The name comes from Latin, meaning acting and conducting, and the system learns over time which models perform best for specific task types while minimizing token consumption.

“It actually uses lesser tokens over time because it learns how to coordinate and choose cheaper models and better models for the task,” Königstein explains. The tool sits between your agent system and the model providers, routing each task to the model that has historically performed best at the lowest cost.

The implication for teams building multi-agent systems is significant. Instead of one expensive model handling every subtask — from simple classification to complex multi-step reasoning — the system learns to assign cheap models to cheap tasks and reserves expensive models for the work that actually requires their capability.

What this means for production agent systems

The practical takeaway is straightforward: benchmark before you commit. Run your specific tasks across multiple models and measure both quality and token consumption. Königstein’s experience suggests you will find at least some subtasks where a cheaper model produces equivalent or better output.

For teams already in production with a single expensive model, the optimization opportunity is real. Swapping even a few subtasks to cheaper models can reduce costs significantly — and in some cases, improve output quality by eliminating the overthinking problem entirely.

FAQ

Do expensive AI models always produce better results than cheaper ones?

No. Nicole Königstein’s research comparing models from Anthropic, OpenAI, Qwen, Grok, and Gemini found that state-of-the-art models sometimes overthink tasks, consuming multiples of the tokens a smaller open-source model uses with no quality improvement — and sometimes worse results. The best model depends on the specific task.

What does it mean when an AI model “overthinks” a task?

The model loops through extended reasoning chains, revisiting the same considerations multiple times before producing output. Königstein observed this with state-of-the-art models on straightforward tasks — the model burned through tokens second-guessing itself while a simpler model completed the task directly. Extended thinking settings amplify this problem.

How do I choose the right AI model for each task in my agent system?

Run your specific tasks across multiple models and compare both output quality and token consumption. Königstein recommends not defaulting to one model for everything. Simple subtasks often perform equally well with cheaper models, while complex multi-step reasoning may benefit from more capable ones. Match the model to the task complexity.

What is Agenens Flow and how does it help with model selection?

Agenens Flow is an open-source tool being developed by Nicole Königstein that automates model selection for multi-agent systems. It learns over time which models perform best for specific task types at the lowest token cost, routing each subtask to the optimal model instead of sending everything to a single expensive provider.

Should I use extended thinking or reasoning modes in AI models?

Use them selectively. Königstein tested thinking depth settings from medium to extreme and found that the highest settings massively increase token consumption without proportional quality gains. Her recommendation: never use the extreme thinking settings, and only apply extended reasoning to tasks where the model genuinely benefits from deeper consideration.

How much can I save by using cheaper AI models for some tasks?

The savings depend on your task mix, but Königstein’s experiments showed expensive models consuming multiples of the tokens used by cheaper alternatives on some tasks. For a multi-agent system with many subtasks, swapping even a few to cheaper models can reduce total token costs significantly while maintaining or improving output quality.

Why do AI teams default to the most expensive model?

Risk aversion. Nobody wants to be responsible for choosing the cheap model when something fails. But Königstein’s research shows this default is often the most expensive mistake — both in direct token costs and in the overthinking behavior that degrades output quality. The safer choice is benchmarking per task and assigning models based on evidence.

What is the difference between model quality and model efficiency?

Quality measures whether the output is correct and useful. Efficiency measures the token cost to produce that output. Königstein found these metrics are not always correlated — an expensive model can produce lower-quality output at higher cost if it overthinks the task. The optimal model maximizes quality per token, not raw capability.

How do different AI models compare as judges in evaluation tasks?

Königstein tested multiple model families as judges and found broad agreement on easy tasks but significant disagreement on harder multi-step evaluations. Different models have different biases, and changing the underlying model can shift judge decisions. She recommends using multiple model families for evaluation and frameworks like RULER for relative comparison.

Expensive AI Models Overthink — When Cheaper Models Win

What overthinking looks like in practice

The model selection problem nobody is solving

Automated model selection as infrastructure

What this means for production agent systems

FAQ

More from Nicole Königstein

Related Insights

Can AI Focus Groups Actually Replace Real Market Research?

Content Assembly vs Content Generation: Why the Distinction Matters for AI Marketing

Why Monitoring Agents Demand Custom Models: The For-Loop Cost Problem