Why Overengineering Your Data Stack Kills AI Projects Before

Most data teams follow the same playbook when they decide to add AI to their analytics stack. First, consolidate all the data. Then build gold tables. Then implement a semantic layer with definitions for every column. Then set up a vector database. Then build a RAG pipeline. Then, finally, let a user touch it.

Mark Hay, CTO and co-founder of TextQL — an agentic data analytics platform that raised $17 million to let enterprises query messy data in plain English — has watched this pattern play out dozens of times. And he thinks it’s backwards.

“The biggest thing I see that I disagree with is people coming at it from the exact opposite direction,” Hay says. Not the teams doing too little with AI. The teams doing too much before they even start.

The Overbuilding Checklist That Kills Projects

Hay describes a specific sequence he sees repeatedly. A data team decides to bring AI into their analytics workflow, and before a single business user sees a result, the engineering roadmap looks like this: ETL pipelines to unify every data source, transformation layers to create pristine tables, a BI tool on top, a semantic layer with column-level definitions, and a vector database feeding a RAG pipeline.

“I think you can see why I’m going with this,” Hay says, listing the layers. Each one is defensible on its own. Together, they create a six-month infrastructure project that delays the thing the business actually wants — answers to their questions.

The cost isn’t just time. It’s organizational momentum. By the time the stack is “ready,” the executive who championed the project has moved on, the budget window has closed, or the team has burned out on infrastructure before shipping anything a user can touch.

The Counterintuitive Alternative

When presented with a naive approach — dump your database schema into Claude, let it write SQL, run the query, return the results — Hay’s reaction surprised even the interviewer who expected him to tear it apart.

“You might be expecting I have a ton of criticism. But that is a very good start.”

His recommended sequence inverts the typical one: start with the simplest approach that delivers value. Then layer on capabilities based on what users actually need. Better sandboxing. Semantic modeling for the queries that keep failing. Observability for the queries that matter most.

“Start with something. And then on top of that, layer on capabilities.”

This isn’t about being sloppy. Security is the one exception Hay carves out — “you should definitely be going for the most secure thing possible from day one.” But for the AI architecture itself, simple and iterative beats comprehensive and delayed.

Why This Pattern Persists

The overbuilding impulse isn’t irrational. It comes from a reasonable fear: if the AI produces wrong answers, the whole initiative loses credibility. So teams try to eliminate error before launch by perfecting every input.

Hay argues the math works the other way. Modern language models are surprisingly good at handling messy, imperfect data. They can explore information schemas, join across sources, and discover patterns without requiring gold-standard preparation upfront. The risk of shipping something imperfect and iterating is lower than the risk of never shipping at all.

TextQL’s own pitch to customers reflects this: “Your data is a disaster, but we can work with it.” Put in what you have, then figure out how to make it better step by step.

FAQ

Why do most AI data projects fail before launching?

Most AI data projects fail because teams spend months building perfect infrastructure — ETL pipelines, semantic layers, vector databases, RAG systems — before any business user sees a result. The overengineering creates delays that exhaust budgets and organizational patience. Starting simple and iterating delivers value faster with lower risk.

What is the simplest way to add AI to enterprise data analytics?

The simplest approach: take your database schema, put it in an LLM’s context window, let the model write SQL queries, execute them, and return results. This works surprisingly well for small teams. Layer on security, semantic modeling, and observability based on what users actually need — not what might theoretically break.

How does TextQL handle messy enterprise data?

TextQL connects to multiple data sources — Snowflake, BigQuery, SAP, and others — and applies AI to model imperfect data without requiring upfront consolidation. Their approach is incremental: start with what the customer has, build an ontology layer over time as patterns emerge, and improve accuracy through usage rather than pre-launch preparation.

What should you build first when adding AI to your data stack?

Start with the simplest approach that returns value to users — even if it’s just a schema dump into an LLM writing SQL. Then add capabilities iteratively: better sandboxes for query execution, semantic definitions for columns that cause errors, observability for high-stakes queries. Security is the one exception — build that from day one.

How long does it take to deploy an AI data analytics solution?

With a simple approach (schema to LLM), a small team can have a working prototype in hours. Enterprise-grade platforms like TextQL can run pilots in 4-8 hours by connecting to existing data sources without requiring data migration. Full deployment timelines depend on the number of sources and security requirements.

Why does overengineering data infrastructure hurt AI adoption?

Overengineering creates a paradox: the more infrastructure you build before launch, the longer users wait, and the harder it becomes to justify the investment. Six-month infrastructure projects often outlive their executive sponsors. Meanwhile, simpler approaches would have delivered value in weeks and generated the usage data needed to know what to build next.

Is it safe to use AI on enterprise data without a semantic layer?

Modern LLMs can handle raw database schemas and produce accurate SQL queries without a pre-built semantic layer. Accuracy improves as you add semantic definitions for frequently-queried columns, but starting without one is viable. The key safeguard is security — sandboxed query execution and access controls matter more than perfect data definitions at launch.

What is the biggest mistake when building an AI data pipeline?

According to Mark Hay, the biggest mistake is “erring on the side of overbuilding the solution” — trying to make the system airtight before any user touches it. This applies to AI architecture specifically, not security (which should be robust from day one). The most successful teams start simple and add complexity based on real user needs.

Why Overengineering Your Data Stack Kills AI Projects Before They Start

The Overbuilding Checklist That Kills Projects

The Counterintuitive Alternative

Why This Pattern Persists

FAQ

More from Mark Hay

Related Insights

Should You Build Your Own AI Brain or Buy One?

The Buy vs. Build Trap — Why 80% of Companies Choose Wrong on Translation

How Customer Data Becomes the Context Window for AI Marketing Models