Working Patterns for AI Development Teams (Part 1: The Framework)

This is Part 1 of a 3-part series on AI development patterns.

Part 1 (this article): The Framework - Core patterns and concepts

Part 2: Implementation Guide - Infrastructure, security, and adoption

Part 3: Production Operations - Observability, ROI, and measuring success

For Engineering Leaders (2-minute read)

These three patterns address AI development’s biggest risks: stale AI suggestions waste hours fixing deprecated code, large AI-generated PRs create review bottlenecks, and passing tests hide semantic failures that escape to production. Implementation costs $1,900-2,100/month for a 10-person team with 5-20 month payback depending on adoption success. Teams that don’t structure AI development accumulate technical debt faster than they ship features.

Read this if: You’re experiencing slower delivery despite using AI assistants, production incidents from AI-generated code, or review bottlenecks from large PRs.

Time to read: 12 minutes | Implementation: 2-4 sprints for pilot team

AI-assisted development breaks traditional workflows in three specific ways that directly impact delivery speed and code quality. Your AI suggests deprecated_function() because its training data is 18 months old—leading to wasted hours fixing deprecated patterns. Your 1,500-line refactoring PR sits for days before someone rubber-stamps it with “LGTM”, creating review bottlenecks that slow feature delivery. Your unit tests pass, but the AI hallucinates in production because tests assume deterministic behavior—catching functional correctness but missing semantic failures.

Static analysis catches syntax errors. Unit tests verify functional correctness. But AI systems operate probabilistically—the same input produces variable outputs based on model weights, prompts, and context. You need working patterns designed for non-deterministic systems.

This article presents a three-layer framework: Spec-Driven Development for planning (reduces rework), Evaluation-Driven Development for quality measurement (catches issues before production), and Structured Code Review for knowledge transfer (reduces review time while improving team learning).

The Three-Layer Framework

Layer 1: Spec-Driven Development (Stop Context Drift)

Traditional AI-assisted coding looks like this: you chat with your AI, explain requirements, clarify constraints, then re-explain architectural decisions three turns later when it forgets. Context drifts, you repeat yourself, and the AI contradicts earlier decisions.

Spec-Driven Development eliminates this by writing persistent Markdown specifications that serve as single sources of truth. These specifications remain stable across AI sessions, reducing time spent re-explaining requirements and enabling consistent implementation regardless of which team member picks up the work.

Core Principles

Separate What/Why from How

Define desired behavior and business intent, letting AI handle implementation details. Your spec describes external behavior—input/output mappings, preconditions, postconditions, invariants—without prescribing specific code structure.

Use Domain-Oriented Language

Avoid implementation-specific terminology (class names, method signatures, framework patterns). Instead, use Given/When/Then scenarios from behavior-driven development to describe user-facing behavior in domain language.

Embed Technical Constraints

Include exact naming conventions, database schemas, and API contracts directly in specs. These aren’t implementation details you’re prescribing—they’re integration requirements imposed by existing systems. The spec documents constraints that any implementation must satisfy, not the specific approach to satisfy them.

Example Specification

❌ Common Mistakes to Avoid:

## Google Login

Add Google OAuth to the app. Create a GoogleAuthController class that extends BaseController. Use the passport-google-oauth20 middleware and configure it in server.js. The user should click a button and then get redirected after logging in. Store the tokens in the database.

Why this fails:

Too implementation-focused (“create a GoogleAuthController class” - prescribes structure)
Vague scenarios (“user should click a button” - no clear success/failure cases)
Missing acceptance criteria (how do we know it works?)
No security guidance (where/how to store tokens?)
Ambiguous intent (“add Google OAuth” - why? what problem does this solve?)

✅ Well-Structured Specification:

## Feature: OAuth2 Authentication

### Business Intent

Enable Google OAuth login to reduce signup friction.
Current completion rate: 15%. Target: measurable increase.

### Given/When/Then Scenarios

**Scenario 1: Successful Login**

- Given: User clicks "Sign in with Google"
- When: OAuth flow completes successfully
- Then: User redirected to dashboard with session token

**Scenario 2: Failed Authentication**

- Given: User denies OAuth permissions
- When: Redirect returns with error code
- Then: Display error message, log failure reason

### Technical Constraints

- Library: `[email protected]` (current stable release)
- Session storage: Redis with 7-day TTL
- Database: `users` table with columns: `oauth_provider`, `oauth_id`, `email`, `refresh_token_encrypted`
- Encryption: Encrypt refresh tokens using AES-256-GCM before storage; use application-level encryption with keys from secret manager (AWS Secrets Manager, HashiCorp Vault)
- Security: No plaintext secrets in logs (enforce via linter); never commit credentials or encryption keys to version control

### Acceptance Criteria

- [ ] OAuth flow completes within reasonable time (measure p95 latency)
- [ ] Token refresh handles network failures gracefully
- [ ] Automated security scan passes (no leaked credentials)
- [ ] Unit test coverage: 80%+ on auth module

How to use this

Paste the complete specification into your AI assistant with the prompt: “Implement the following specification: [spec content]”. The AI will generate code that satisfies the constraints and scenarios. If it asks clarifying questions about behavior, update the spec with that information before regenerating—the spec should remain your single source of truth.

The spec remains stable across iterations, serving as the contract between product owners, developers, and AI tools. When requirements change during sprint refinement or stakeholder feedback, update the specification first as your team’s source of truth, then regenerate code. This keeps all team members aligned on the “what” while developers and AI focus on the “how.”

Why it works:

GitHub engineering documented using this pattern to build internal tools entirely from Markdown specifications, with AI assistants generating implementation code from structured requirements documents. The approach eliminated back-and-forth clarification cycles that previously consumed significant development time.

Layer 2: Evaluation-Driven Development (Measure “How Well”)

Traditional Test-Driven Development asks “Does it work?” with binary pass/fail assertions—suitable for deterministic code but insufficient for AI systems that produce variable outputs.

Evaluation-Driven Development asks “How well does it work?” with continuous measurement across multiple dimensions (quality, cost, performance, reliability), enabling teams to catch degraded AI outputs, cost spikes, and performance regressions before they impact users or budgets.

The Fundamental Shift

AI systems don’t “work” in the traditional sense—they perform within acceptable statistical bounds under specific conditions at particular cost points. This means you can’t guarantee a single “correct” answer, but you can ensure that most outputs meet quality standards, at a predictable cost per request, with acceptable latency.

The same email extraction prompt might return "[email protected]" or "[email protected]" on consecutive runs. Both are correct outputs, but a traditional assertion like assert result == '[email protected]' fails 50% of the time due to case variation. You need assertions that validate semantic correctness—“is this a valid email?”—rather than exact string matching.

Five Failure Modes Traditional Testing Misses

Binary Illusion - Expected exact match fails due to valid output variation (e.g., '[email protected]' vs '[email protected]' are both correct, but traditional tests fail)
Model Volatility - Model updates break previously passing tests without code changes (external AI provider updates can silently degrade your application)
Cost Blindness - Tests pass while infrastructure costs spike 10x (a passing test doesn’t reveal you’ve switched from a cheap model to an expensive one)
Model Drift - Prompt effectiveness degrades over time without observable code changes (gradual quality decline hard to detect without continuous monitoring)
Hidden Triggers - Context drift, database updates, feedback loops cause failures independently (AI systems fail in ways traditional software doesn’t, requiring different monitoring approaches)

Multi-Dimensional Metrics

Dimension	What to Measure	Example Threshold	Why It Matters
Quality	Accuracy, precision, recall	>90%	Semantic correctness
Performance	Latency (p95), throughput	<200ms	User experience
Cost	$ per 1K requests	<$0.05	Operational viability
Reliability	Success rate over time	>99%	Production stability

How Major AI Labs Do This

OpenAI open-sourced their evaluation framework, stating “building effective evals is core to LLM application development.”

Microsoft Azure AI implements continuous evaluation across model selection, pre-production, and production monitoring stages.

Google DeepMind uses three independent LLM judges to verify factuality, reconciling disagreements through majority voting. This reduces false positives from individual judge hallucinations—a single judge might incorrectly approve an answer, but the probability of two judges making the same error is significantly lower.

Practical Implementation

Start with one component to build team capability before scaling:

Scope narrowly - Focus on single capability (e.g., “classify support tickets by urgency”) to prove value before expanding
Create baseline dataset - Build 50-100 labeled examples including edge cases. Include boundary conditions, adversarial inputs, and common user errors
Define metrics - Choose quality, performance, and cost measures based on business requirements
Set thresholds - Establish minimum acceptable performance aligned with user expectations and operational budgets
Automate evaluation - Integrate into CI pipeline to catch issues before deployment
Monitor drift - Alert when metrics degrade so teams can respond before user impact escalates

Example CI Integration

Prerequisites: Create .promptfoo/config.yaml in your repository defining test cases and evaluation criteria. Store API keys as repository secrets.

Step 1: Create evaluation configuration

# .promptfoo/config.yaml
description: "Email extraction evaluation"

prompts:
  - "Extract the email address from the following text: {{input}}"

providers:
  - id: openai:gpt-4
    config:
      temperature: 0  # Deterministic for evaluation
  - id: openai:gpt-3.5-turbo  # Compare models
    config:
      temperature: 0

tests:
  - vars:
      input: "Contact me at [email protected] for more info"
    assert:
      - type: contains
        value: "[email protected]"
      - type: javascript
        value: "output.match(/^[\\w.-]+@[\\w.-]+\\.\\w+$/)"  # Valid email format

  - vars:
      input: "Reach out: [email protected]"
    assert:
      - type: icontains  # Case-insensitive
        value: "[email protected]"

  - vars:
      input: "Email us at [email protected] or [email protected]"
    assert:
      - type: contains-any
        value: ["[email protected]", "[email protected]"]

  - vars:
      input: "No email address in this text"
    assert:
      - type: is-json
      - type: javascript
        value: "output === null || output === ''"  # Should return null/empty

outputPath: ./promptfoo-results.json

defaultTest:
  options:
    provider: openai:gpt-4

Step 2: Set up GitHub Actions workflow

# .github/workflows/ai-evaluation.yml
name: AI System Evaluation

on: [pull_request]

jobs:
  eval-quality:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: "20"

      - name: Install dependencies
        run: npm ci

      - name: Run evaluation suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          npx promptfoo eval --config .promptfoo/config.yaml --output json > results.json

      - name: Check quality threshold
        run: |
          PASS_RATE=$(jq -r '.stats.successes / .stats.total' results.json)
          THRESHOLD=0.90
          if (( $(echo "$PASS_RATE < $THRESHOLD" | awk '{print ($1 < $2)}') )); then
            echo "❌ Quality threshold not met: ${PASS_RATE} < ${THRESHOLD}"
            exit 1
          fi
          echo "✅ Quality threshold met: ${PASS_RATE} >= ${THRESHOLD}"

Cost Consideration

Running evaluations on every PR accumulates API costs. For a team of 10 developers with ~50 PRs/week:

Conservative: ~$100/month (using cheaper models like GPT-3.5-turbo or Claude Haiku)
Expected: ~$200/month (mix of GPT-4 and cheaper models)

See Part 2: Implementation Guide for complete cost breakdown.

This shifts evaluation left—catching semantic failures before merge rather than discovering them in production.

Layer 3: Structured Code Review (Transfer Knowledge)

Code review serves two purposes: catch defects and transfer knowledge. But traditional reviews fail at both, especially with AI-generated code. AI generates features fast—developers create 1,500-line refactoring PRs that sit for days awaiting review. When PRs finally get reviewed, generic “LGTM” approvals accomplish neither goal, leading to defects that escape to production and knowledge silos that slow onboarding.

Solving this requires two strategies working together:

Strategy 3a: Manage Review Volume - AI generates code faster than humans can review it. Break large changes into stacked PRs (50-120 lines each) so reviewers can provide thoughtful feedback on digestible chunks instead of rubber-stamping overwhelming changesets. [See Part 3: Production Operations for stacked PR implementation details, tooling options like Graphite/Ghstack, and when to use this approach.]

Strategy 3b: Structure Reviews for Knowledge Transfer - Even with manageable PR sizes, generic “LGTM” comments waste the review opportunity. The Triple R Pattern transforms every review into a teaching moment that compounds team knowledge over time.

The Triple R Pattern

Structure every review comment with three components:

Request - Specific change needed (what to do)
Rationale - Why it matters, with reference to team standards, security best practices, or performance benchmarks
Result - Desired outcome described concretely (what good looks like when done)

Example

❌ Unhelpful:

“This function is too complex”

Why this fails:

No specific action requested
No explanation of the problem
No guidance on what “good” looks like
Junior devs don’t learn anything

✅ Structured (Triple R):

Request: Refactor processUserData() into smaller units
Rationale: Functions over 50 lines are hard to test and maintain (team standard: max 50 lines per function)
Result: Break into validateInput(), transformData(), persistToDatabase() - each with single responsibility and independent test coverage

This transforms reviews from gatekeeping into mentorship. Junior developers receive specific guidance. Senior developers document architectural reasoning for future reference.

Focus Human Attention

Let AI pre-filtering catch trivial issues. Focus reviewers on what matters:

Architecture decisions - Does this fit our system design?
Business logic correctness - Does this solve the user problem?
Security implications - What are the attack vectors?
Maintainability - Will this be understandable in 6 months?

What’s Next

You now understand the three-layer framework for AI development:

Spec-Driven Development - Persistent specifications eliminate context drift
Evaluation-Driven Development - Multi-dimensional metrics catch AI-specific failures
Structured Code Review - Transform gatekeeping into knowledge transfer

Ready to implement?

Part 2: Implementation Guide covers infrastructure requirements, security considerations, and phased adoption strategies
Part 3: Production Operations covers production observability, incident response, and measuring ROI

Glossary

: The latency value below which 95% of requests fall. If p95 = 200ms, that means 95% of requests complete in under 200ms, with the slowest 5% taking longer.
: Systems where the same input can produce different outputs each time. AI models are probabilistic—they generate responses based on statistical patterns, not fixed rules.
: When AI outputs are grammatically correct and syntactically valid but meaningless or incorrect in context. Example: extracting "[email protected]" from "Contact us at [email protected]".
: A standard that allows AI tools to fetch current documentation on-demand instead of relying on outdated training data. Solves the problem of AI suggesting deprecated code patterns.
: A curated set of labeled examples used to measure AI system quality. Contains inputs with expected outputs for testing if the AI performs correctly across various scenarios.
: A code review structure requiring three components: Request (what to change), Rationale (why it matters), and Result (what good looks like when done).
: A behavior-driven development format for describing features: Given (preconditions), When (action), Then (expected outcome). Makes requirements testable and unambiguous.

References

GitHub Engineering , "Spec-Driven Development: Using Markdown as a Programming Language" , September 30, 2025. https://github.blog/ai-and-ml/generative-ai/spec-driven-development-using-markdown-as-a-programming-language-when-building-with-ai/
Thoughtworks , "Spec-Driven Development: Unpacking One of 2025's Key New AI-Assisted Engineering Practices" , December 4, 2025. https://www.thoughtworks.com/en-us/insights/blog/agile-engineering-practices/spec-driven-development-unpacking-2025-new-engineering-practices
AltexSoft , "Acceptance Criteria for User Stories in Agile: Purposes, Formats, and Best Practices" , December 1, 2023. https://www.altexsoft.com/blog/acceptance-criteria-purposes-formats-and-best-practices/
Nimrod Busany , "From TDD to EDD: Why Evaluation-Driven Development Is the Future of AI Engineering" , July 25, 2025. https://medium.com/@nimrodbusany_9074/from-tdd-to-edd-why-evaluation-driven-development-is-the-future-of-ai-engineering-a5e5796b2af4
Netguru , "Testing AI Agents: Why Unit Tests Aren't Enough" , July 9, 2025. https://www.netguru.com/blog/testing-ai-agents
Umamaheswaran , "Improving Code Reviews: From LGTM to Excellence" , November 25, 2024. https://umamaheswaran.com/2024/11/25/improving-code-reviews-from-lgtm-to-excellence/