Strategy · 9 min read

The five AI projects that almost always fail and why

After shipping AI products for clients across four continents, I keep seeing the same five projects fail in the same five ways. Each of them is fixable, but only if you know the failure mode before you start. This is what to look for, and what to do instead.

Failure 1. The chatbot for everything

The pitch sounds reasonable. We will put a chatbot on our SaaS, customers will ask it anything, it will answer. Six months later the team has burned 80,000 euros, the chatbot answers 40 percent of questions wrong, and the support team refuses to use it because it makes their job harder.

The structural problem is scope. A chatbot for everything has no acceptance criteria, no eval set, and no clear baseline to beat. There is no answer to the question is this working. Replace it with three narrow, named workflows. Generate a quote. Summarize this customer's last 30 days. Draft a renewal email. Each is testable, each has a baseline, each ships in three weeks.

Failure 2. The PoC that was never going to make production

A team builds a Jupyter notebook, the model produces beautiful results on five hand-picked examples, the founder demos it to the board, and the entire thing dies the moment someone asks for latency, error handling, observability, or a price per call. The PoC was a sales artifact, not a system.

This is the single most common pattern in enterprise AI. I have rescued three of them in the last year. The fix is to write the production constraints before the notebook. What is the SLA. What is the cost ceiling per request. Who is the user. What happens when the model is wrong. Treat the PoC as a slice of the production system, not as a science fair project.

Failure 3. The fine-tune that should have been a prompt

A founder reads on a podcast that fine-tuning is the path to a moat, hires an ML engineer, spends 40,000 euros generating training data, and ships a model that performs identically to a 200-token prompt against a frontier model. Now they own infra, they own model maintenance, and they cannot upgrade to the next generation without redoing the whole exercise.

Fine-tuning has a real role, but it is rarely where founders think it is. Use prompts and retrieval first. Measure the failure modes. Fine-tune only when prompts cannot solve a specific, measurable, repeatable shortfall. I cover the trade-offs in detail in my piece on RAG, fine-tuning, and agents.

Failure 4. The agent stack that loops

An agent that calls tools, then calls itself, then calls more tools, then loops, then runs up a 12 dollar bill on a single user request. By the time the team adds rate limiting and budget caps, the user experience is degraded, the trust is gone, and the agent is quietly downgraded to a single-turn assistant.

Agents are the most over-deployed AI pattern of the last 18 months. They look like the future, and sometimes they are. But for 80 percent of business workflows, a deterministic pipeline of two or three model calls beats an agent on cost, latency, and reliability. Reach for an agent only when the workflow has open-ended branching the user cannot specify in advance. If the user can describe the steps, do not give an agent the steering wheel.

Failure 5. The model that nobody evaluates

This is the quiet killer. Six months after launch, the team gets a churn signal. Customers complain that the AI feature is unreliable. Nobody can answer how often the model is right because nobody built the eval harness. The team scrambles to build one, discovers the model has been wrong 30 percent of the time since launch, and has to face a tense customer call with no good answer.

I push every client to ship evals before they ship the feature. A simple golden set of 50 to 200 labeled examples, run on every prompt change, every model upgrade, every retrieval index rebuild. The cost is one engineer for a week. The cost of not having one is your customer relationship. There is no negotiation on this point.

The pattern under the patterns

Look at the five failures and the same root cause shows up in each one. The team optimized for the demo, not the system. AI has a peculiar property. It can produce a stunning demo on a Monday and lose the customer's trust on a Tuesday, because the demo path was the only path the team ever exercised. Production is not the demo path. Production is every angry edge case at 3 a.m. on a Sunday.

The startups that ship AI successfully invert the order. They start with the production constraints. They build the evals first. They ship one narrow workflow with a real customer. They measure for two weeks before adding scope. The demo comes at the end, and by then the demo is not the goal, it is a byproduct.

What this looks like in practice

When AlbTech Solutions takes on a new AI engagement, we run a two-week discovery that produces three artifacts. A workflow map of the work the AI is replacing or augmenting. A constraints document covering latency, cost, accuracy targets, and failure handling. An eval plan with the first 50 labeled examples and the metric that defines success. Only then do we touch a model.

The teams that follow this rhythm ship in 90 days. The teams that skip it ship in 18 months, or never. The cost of the discovery is two weeks. The cost of skipping it is the project.

If you are early in an AI build and you are seeing one of these five patterns in your own team, write to me. I respond within 48 hours and I am happy to give you 30 minutes on the structural fix.

Working on something like this?

I respond to every email within 48 hours. If you want a second opinion before you commit budget, get in touch.

More on strategy