Founder Insight

Why LLMs Are a Search Problem, Not a Generative Problem

Mike Taylor, CEO at Ask Rally

Listen on TL;Listen Prefer to listen? Hear this article read aloud.

Most developers building with large language models treat them as creative engines. You describe what you want, the model generates it, and you hope the output is good enough. Mike Taylor thinks that framing is why so many AI products break in production.

Taylor is the CEO of Ask Rally — an AI focus group platform that creates synthetic personas calibrated against real human interview data — and the author of O’Reilly’s prompt engineering textbook. He spent four years working with GPT-3 and its successors before most engineers had API access. His conclusion after building production systems across dozens of clients: the mental model most people use for LLMs is wrong.

The search reframe

“It actually really helps to look at an LLM as a search problem rather than a generative problem because they don’t actually generalize very well outside of the data set,” Taylor explains. “It’s just the data set is very large, right? Like the data set is the whole internet.”

The implication is practical. If a task hasn’t appeared frequently in training data, the model will hallucinate. Your job as a builder isn’t to write better prompts that inspire creativity — it’s to make your task more common for the model. You do that by injecting relevant examples into the prompt or fine-tuning the weights.

This maps directly to how Taylor approaches client work. When a company hires him to improve an AI workflow, the first question is always about data: do you have enough good examples of the task you’re trying to solve? Not descriptions of what you want. Actual examples.

The 10-to-200 examples ladder

Taylor lays out a concrete progression that any team can follow. Under 10 examples, just add them directly to the prompt — no optimization framework needed. Over 10, use an automated optimizer like DSPy’s Jeep optimizer, which costs under a dollar per run. For mission-critical production systems, aim for 200 examples, collected gradually from user logs.

“I don’t think it’s a one and done process,” he says. “I think you can actually just have it running a regular optimization every week. It just captures the user preferences slightly better this time.”

The gains are significant. With just 10 examples, Taylor has seen accuracy improvements of 50-60%, depending on task complexity. And the investment scales: you start by vibing (“do I think the results look good?”), graduate to team review, then harvest user interaction logs to feed the optimizer continuously.

Why examples beat descriptions

There’s a tension here that trips up even experienced ML engineers. If you can describe what you want, why bother collecting examples? Taylor’s answer: because humans are bad at articulating the nuances of their preferences, and AI is surprisingly good at inferring them from patterns.

“If I give you 100 pictures of cats, 100 pictures of dogs, and you run that through DSPy or whatever, it’s going to figure out how to classify pretty easily,” he says. But the real value shows up on harder tasks — blog writing, ad copy, persona creation — where the difference between good and bad is subjective. Collect a swipe file of outputs the client loves and outputs they hate, and let the model find the patterns humans struggle to name.

What this means for production AI

The search framing solves a specific class of production failures: the ones where the model works in demos but breaks on edge cases. If you know the model is searching, not creating, you can diagnose failures by asking whether the task was in-distribution. If it wasn’t, you add examples to make it so. The debugging process becomes empirical rather than vibes-based.

Taylor has rarely needed to reach for fine-tuning. Prompt optimization with sufficient examples handles most production use cases. The exception is when you need to move off frontier models to open-source for cost reasons — and even then, DSPy’s optimization can bridge that gap for under a dollar per run.

FAQ

What does it mean that LLMs are search rather than generative?

LLMs retrieve and recombine patterns from training data rather than creating genuinely new content. When a task falls outside what appeared in training data, accuracy drops sharply. Treating AI outputs as search results — constrained by what the model has seen — leads to better engineering decisions about when to add examples versus when to fine-tune.

How many examples do you need for production AI prompt optimization?

Under 10 examples: add them directly to the prompt. Over 10: use an automated optimizer. For mission-critical production: aim for 200 examples. The cost of running an optimization with DSPy’s Jeep optimizer is under a dollar for most tasks, making it accessible even for small teams and side projects.

Why do good examples matter more than good prompt descriptions?

Humans struggle to articulate what makes one output better than another, especially for subjective tasks like writing or design. AI pattern recognition excels at finding the differences between a set of liked versus disliked examples. A swipe file of examples often produces better prompts than carefully written instructions alone.

What is DSPy and how does it optimize prompts?

DSPy is an open-source framework that automates prompt optimization using evolutionary algorithms and mini-batch testing. It tests prompts on holdout examples to avoid overfitting, reports accuracy on data it hasn’t learned from, and produces plain-English prompts that capture patterns from your example set — all for under a dollar per run.

How do AI startups reduce costs when moving from frontier to open-source models?

The primary cost driver is paying by token on frontier models like GPT-4 or Claude Opus, which cost 10-30x more than open-source alternatives. Prompt optimization through tools like DSPy distills knowledge into shorter, more efficient prompts that perform well on cheaper models. Most production workflows don’t require fine-tuning if examples are sufficient.

What accuracy gains can you expect from prompt optimization?

With 10 examples, accuracy improvements of 50-60% are common depending on task complexity and baseline. DSPy’s Jeep optimizer has doubled accuracy from a low baseline in some cases. Gains diminish if an LLM or a prompt engineer already wrote a strong initial prompt, but optimization saves days of manual tuning for comparable results.

How do you know if an AI task is in-distribution for the model?

Ask whether the task appears frequently on the internet or in the model’s training data. Common formats like email writing or code completion are in-distribution. Niche tasks specific to your business likely are not. When accuracy drops, the first diagnosis should be whether you need to add examples that make the task more common for the model.

What is the difference between prompt engineering and prompting?

Prompting is casual AI interaction — describing what you want in natural language. Prompt engineering involves systematic testing, A/B testing, workflow design, and cost optimization at scale. Some people who are great at casual prompting struggle with engineering, and vice versa. The engineering component treats prompts as production systems with reliability requirements.

Can prompt optimization replace fine-tuning for production AI?

In most cases, yes. Prompt optimization with sufficient examples handles the majority of production use cases without fine-tuning. Fine-tuning becomes necessary only for extreme cost reduction (moving to very small open-source models) or when the task is so specialized that no amount of in-context examples can cover the distribution. DSPy supports both approaches.

Full episode coming soon

This conversation with Mike Taylor is on its way. Check out other episodes in the meantime.

Visit the Channel

More from Mike Taylor

Related Insights