From prompts to contracts: an intro to Spec Driven Development

TL;DR

Spec Driven Development treats intent as a versioned artifact. Instead of prompting AI with vague requests, you write a spec that defines behaviour, constraints and edge cases. The AI implements the spec rather than inventing requirements. When output looks wrong, you fix the code or fix the spec. Three adoption levels exist: spec-first (write once, mostly forget), spec-anchored (evolve spec with code) and spec-as-source (spec is the only thing humans edit).

Read this if: You want the speed of AI-assisted development without letting the model decide how your system behaves.

Even in strong teams only part of the real intent ever makes it into durable artifacts. You get architecture diagrams and if you’re lucky the odd ADR. Tickets exist but the real detail lives in tribal knowledge and in the code itself. In a human-only workflow this is not ideal, but it can be acceptable. The people who design and review the change usually share enough context to fill the gaps.

AI assisted development breaks that safety net. Once you let an LLM generate non trivial chunks of your system, the “author” of new behaviour is no longer a person who remembers why it looks that way. Code becomes cheap to produce. A durable understanding of what that code is supposed to do does not.

When intent is not written down, the model fills the gaps. It infers missing requirements then bakes in choices you never discussed, like retry semantics or whether errors should be swallowed and logged. The changes often look fine on review yet the real definition of correct behaviour has quietly moved from your team’s head into whatever the model happened to infer that day.

Spec Driven Development is a response to that shift. It treats intent as a first class artifact, a memory bank for behaviour and uses it to anchor AI output to something the team can read, test and version over time.

AI assisted development and the missing system of record

Most teams started with “AI as autocomplete” then quietly escalated to “AI as pair programmer”. The step change is not just speed. It is the economics of iteration: generating another approach costs almost nothing so you do it all the time.

That creates two predictable failure modes.

First, context loss: prompt history is not durable. It does not survive team handoffs or three months of other work landing on main.

Second, implicit decisions: when you do not spell out constraints the model picks defaults. That shifts the definition of correct behaviour from your requirements to whatever the model inferred from the prompt and the codebase and these inferences rarely get reviewed as decisions.

There is also a more mechanical limit: context windows. An agent only sees a fixed amount of input on each call. As your system grows, you either drop important background information or keep re-explaining it in every session.

All three point to the same root issue: prompts are a terrible system of record. A prompt optimises for getting work done right now. It does not give you a stable, reusable bundle of intent that you can apply across sessions, people or agents.

What a spec is and how SDD uses it

“Spec” can mean just about anything: a PRD, a user story, an OpenAPI YAML or a sketch in Miro. For this discussion, let’s treat a spec as a structured document in natural language that describes how a slice of the system should behave. It’s a small, reviewable repo artifact that defines behaviour and constraints precisely enough that two engineers would build the same thing: it leaves the lowest reasonable amount of space for ambiguity.

By this definition, “add a notifications feature” is not a spec (and to be fair it never should be considered one). It is a vague request. A spec for notifications would say who gets notified, on which events, in what order, with what rate limits, how retries work and which failures reach the user.

There is no universal spec format. Tools like Tessl, Kiro and BMAD each take their own approach. What matters is that the spec captures enough detail to remove ambiguity from implementation.

Here is an example:

## Feature: Order Completion Email

### Business Intent

Notify customers when their order is complete to reduce "where is my order" support queries.

### Given/When/Then Scenarios

**Scenario 1: Successful Notification**

- Given: Order status changes to `completed`
- When: Customer has valid email on record
- Then: Send email using `order-complete-v1` template with order summary and estimated delivery

**Scenario 2: Missing Email**

- Given: Order status changes to `completed`
- When: Customer has no email on record
- Then: Skip notification, log warning with order_id for ops review

**Scenario 3: Transient Failure**

- Given: Email send fails with 5xx error
- When: Retry count < 3
- Then: Retry with exponential backoff (1s, 2s, 4s)

**Scenario 4: Permanent Failure**

- Given: Email send has failed 3 times
- When: Next retry would occur
- Then: Mark as failed, alert ops, stop retrying

### Technical Constraints

- Template: `order-complete-v1` with order_id, item summary, estimated delivery
- Subject line: "Your order #{{order_id}} is complete"
- Out of scope: SMS fallback, batching multiple orders into one email

### Acceptance Criteria

- [ ] Email queued synchronously on status change (no background delay)
- [ ] Missing email cases logged with order_id
- [ ] Retry logic handles transient failures without sending duplicates
- [ ] All four scenarios covered by automated tests

The format is less important than the discipline: write down the decisions that would otherwise live in someone’s head or get silently invented by the model.

Spec driven development takes this kind of spec and puts it at the centre of AI assisted work. You first write or refine the spec. The spec compresses all the relevant context into something that fits inside the model’s window and can be reused across prompts. Then you ask the model to implement that spec. When the code looks wrong, you argue with the spec, not the diff. Either the implementation is off or the spec was incomplete and must change.

The practical shift is simple: the model stops inventing requirements on the fly. It becomes a very fast implementer of behaviour you have already agreed on.

SDD implementation layers

In the wider industry there is a spectrum of views on what to do with specs once you have them. At one end you treat specs as the primary source of truth and see code as a generated byproduct. Update the spec, regenerate the implementation, repeat. At the other end you treat specs more like tests or design notes that drive code generation. In the latter model the executable code remains the artifact you ultimately maintain, while specs exist to shape it and keep intent visible.

This difference shows up as three common “implementation layers”.

Spec first

You write a solid spec, use it to drive AI work for the task, ship the change, then mostly forget the spec. It might stay in the repo but nobody treats it as essential once the feature is live. This is the easiest entry point and still better than pure vibe coding, because at least the initial intent is explicit and reviewable. (For more on why unstructured prompting falls apart in production, see Vibe Coding is not a production strategy.)

Spec anchored

You keep the spec around and evolve it alongside the implementation. Any behavioural change updates the spec and the code together. The spec becomes part of the maintenance surface rather than a one-off artifact.

This is where SDD starts paying long-term dividends, because you can diff intent over time and “silent drift” becomes harder.

Spec as source

The spec is the main source file over time. Humans edit the spec and code is generated from it. Humans do not touch the generated code.

This is the most extreme form. Some tools explicitly aspire to it. Tessl for example positions specs as the primary artifact and explores generated code that should not be edited by hand, going as far as adding a // GENERATED FROM SPEC - DO NOT EDIT comment.

A typical SDD workflow

Different tools implement SDD differently, but a few principles hold regardless of tooling.

Intent comes first. You define behaviour before asking for code. The format varies but the sequence does not: decide what the system should do, write it down then implement.
The spec gets refined before implementation starts. Whether you prompt an LLM to critique your draft or use a structured agent workflow, this is where you catch missing edge cases, ambiguous language and implicit assumptions.
Implementation is measured against the spec. When output does not match what the spec says, you fix the code or fix the spec. The spec is the reference point, not the diff. This changes how you review: instead of asking “does this code look right?” you ask “does this code do what the spec says?”
Spec and code travel together. Changes to behaviour update both artifacts. How tightly they stay coupled depends on your implementation layer: spec-first teams might let specs drift after merge, spec-anchored teams update them with every behavioural change, spec-as-source teams treat the spec as the only thing humans edit.

The model’s role in all of this is to implement whatever the spec says. The spec defines the loop.

Where SDD breaks

Spec-driven development does not magically make AI predictable: even with all the guardrails in place, code generation remains nondeterministic. You can give the model the same spec twice and get completely different implementations. (If you are just getting started with agents, see 5 mistakes when starting with Agents for more on working with nondeterminism).

This means that relying on tests, CI and review to keep quality and architecture in one piece is paramount.

SDD also costs time up front. You write specs, review them and keep them in sync with behaviour. If that work does not have real protection in the process, specs start to rot.

The fix is not clever. Treat spec updates as part of the PR definition of done and enforce that as strictly as formatting or tests. If behaviour changes, the spec changes in the same PR. No exceptions, no “I will update the doc later”. That is the only way the spec stays useful enough that people keep using it.

If you want to go deeper, here’s where to start:

5 mistakes when starting with Agents - Common pitfalls when adopting SDD and how to avoid them
8 Patterns for Spec-Driven Development - Practical architectural patterns once you have the basics down
Working Patterns for AI Development Teams (Part 1) - The broader framework: spec-driven context, evaluation-driven quality, and structured code review
Overcoming the Human Bottleneck in AI-driven Software Development - Why human validation remains the crucial bottleneck and what to do about it

Glossary

: A document that captures an important architectural decision, including the context, the decision itself and its consequences. ADRs create a log of why the system looks the way it does.
: A document that describes what a product or feature should do from a user perspective. Typically written by product managers: it outlines goals, user stories and success criteria.
: The maximum amount of text an AI model can process in a single request. Once you exceed this limit, older information gets dropped. This is why long conversations or large codebases require careful context management.
: Systems where the same input can produce different outputs each time. AI models are probabilistic—they generate responses based on statistical patterns, not fixed rules.

AI assisted development and the missing system of record

What a spec is and how SDD uses it

SDD implementation layers

Spec first

Spec anchored

Spec as source

A typical SDD workflow

Where SDD breaks

Related Reading

Glossary

Glossary