Methodology · 9 min read

How to write an AI product spec your engineers will actually build

Writing a product spec for an AI feature is different from writing a normal product spec. The same document that worked for a CRUD form will not work for a model-driven workflow. I rewrote the spec template I use after the third time I watched a clean-looking spec produce a confused engineering team. This is the version that holds up.

The seven sections of an AI product spec

  1. The problem and the user.
  2. The current workflow being replaced or augmented.
  3. The acceptance criteria.
  4. The model contract.
  5. The eval plan.
  6. The failure handling.
  7. The non-AI fallback.

If your spec has all seven and they are written with specifics, your engineering team can build it. If any section is hand-waved, the project will stall at exactly the spot you skipped.

Section 1. The problem and the user

One paragraph. Who has the problem, what the problem is, why now. Specific roles, specific moments. Not customers. The Italian restaurant supply manager on a Monday morning placing the weekly order. Not users. The compliance analyst at minute 47 of a quarterly audit walking through a regulator's checklist.

If you cannot name the role and the moment, the spec is not ready.

Section 2. The current workflow

Map the work the AI is replacing or augmenting, step by step, with timing. Manual labor only. The state of the world before the model touches it. Where does the work start. What does the human read. What do they write. Where do they hesitate. Where do they get it wrong today.

This section is the artifact engineers reference most often during build, and it is the section founders write last. Reverse the order. Write this in week 1 of discovery and have a domain user sign it. The model is a transformation of this workflow, and you cannot transform what you cannot describe.

Section 3. The acceptance criteria

This is the section that separates an AI spec from a deterministic spec. Acceptance for a normal feature reads like the user clicks save and the record persists. Acceptance for an AI feature has to read like the model produces a result that meets the following measurable criteria on the following labeled examples.

Three concrete things must appear here. The accuracy bar, expressed against a named labeled set. The latency target, expressed in seconds at the 95th percentile. The cost ceiling, expressed in cents or euros per request. Anything softer than this leaves the engineers guessing what good looks like.

Section 4. The model contract

This is the section that is missing from most specs and the one most responsible for late-stage rework. The model contract specifies the structured input the model receives and the structured output it must produce. Fields, types, enums, constraints, ranges.

For a proposal generator the contract might say the input is a customer brief, a product catalog excerpt, and three reference proposals, and the output is a JSON object with fields summary, line items, total, and rationale. The summary is bounded at 600 characters. The line items array contains objects with sku, quantity, unit price, and total. The rationale is plain text bounded at 1,500 characters.

Once the model contract is fixed, the front-end team can build against it before the model is ready. Mock the contract, ship the UI, swap in the model. This is the trick that compresses AI build timelines.

Section 5. The eval plan

Three things appear here. The labeled set, with size and source. The metric, with a single sentence describing how it is computed. The baseline, with the score the current human or rule-based process produces on the same set. Without a baseline, you cannot tell whether the AI is helping.

The eval plan must list the first 50 to 200 examples by name or ID, owned by a named domain expert who labels them in the first two weeks of the build. If the eval plan is written without a labeling owner, it will not exist when you need it.

Section 6. The failure handling

What happens when the model gets it wrong. Founders skip this section because they assume the model will mostly be right. The model will mostly be right. The product lives or dies in the cases where it is not.

Three failure handling strategies need to appear in the spec. What the user sees when the model returns low confidence or refuses. What the system does when the model exceeds the latency or cost ceiling. What the audit trail captures so a human can review and correct the answer after the fact. A spec without all three of these is a spec that ships a fragile feature.

Section 7. The non-AI fallback

This is the section senior engineers love and inexperienced product managers skip. Every AI feature should have a non-AI path that produces a usable, if degraded, result. Templates instead of generation. Filters instead of semantic search. A human escalation instead of an agent loop.

The fallback exists for three reasons. The model may be unavailable. The cost ceiling may be hit. The customer may operate in a context where the model is not allowed. If the spec does not name the fallback, the engineering team will improvise one under pressure, badly.

Worked example. A draft generator for compliance memos

Section 1. The user is a compliance analyst at a financial institution preparing a quarterly memo for the regulator. The moment is the day before the deadline, when they are stitching together findings from 30 to 60 documents and a structured questionnaire response.

Section 2. Today the analyst opens 30 to 60 documents in tabs, copies relevant excerpts into a draft, restructures, edits, and reviews. The work takes six to nine hours per memo and produces 40 to 70 percent of the analyst's weekly output.

Section 3. Acceptance is that the model draft is rated 4 or 5 by a senior analyst on a 5-point usability scale on at least 80 percent of a labeled set of 100 historical memos. Latency target is 90 seconds at the 95th percentile. Cost ceiling is 1.20 euros per draft.

Section 4. Input is a structured memo brief and a list of source document IDs. Output is a JSON object with fields title, executive_summary, sections (array of section_title, body), citations, and confidence. Each section body is bounded at 4,000 characters. Citations reference source document IDs.

Section 5. Eval set is 100 historical memos labeled by two senior analysts. Metric is the average usability rating on a 5-point scale. Baseline is the current analyst's first-draft rating, which historically averages 3.2.

Section 6. On low confidence the system flags the section for human review. On latency or cost overrun the system returns a structured shorter draft. The audit trail captures the prompt, the retrieved documents, the response, and the analyst's edits.

Section 7. The non-AI fallback is a structured template the analyst fills in manually with retrieval-only suggestions of relevant documents.

The line that decides whether the project ships

Section 3, the acceptance criteria. If the team agrees on a measurable, labeled, latency-bound, cost-bound acceptance bar, the project ships. If the bar is soft, the project drifts. Spend the time to land this section, and the rest follows.

If you want a second opinion on a spec before you hand it to engineering, write to me. I respond within 48 hours.

Working on something like this?

I respond to every email within 48 hours. If you want a second opinion before you commit budget, get in touch.

More on methodology