This is Part 2 of a 3-part series on AI development patterns.
- Part 1: The Framework - Core patterns and concepts
- Part 2 (this article): Implementation Guide - Infrastructure, security, and adoption
- Part 3: Production Operations - Observability, ROI, and measuring success
Implementing AI development patterns requires $155-330/month infrastructure spend plus 6-12 months organizational change management—not the 2-3 months most teams expect. The critical security risks are MCP supply chain attacks (compromised documentation sources), secrets in AI context (API keys leaked to model providers), and unvalidated AI code (subtle vulnerabilities that pass static analysis). Start with a single pilot team on one isolated component; teams that try to adopt all patterns simultaneously experience coordination overhead that negates productivity gains.
Read this if: You need infrastructure requirements, security guidance, and phased adoption strategy for the patterns from Part 1.
Time to read: 15 minutes | Prerequisites: Read Part 1 first
You understand the three-layer framework from Part 1: spec-driven context eliminates drift, evaluation-driven quality catches AI failures, and structured review transfers knowledge. Now you need to actually implement these patterns without disrupting current delivery or creating security holes.
This guide covers the infrastructure requirements, security considerations, and phased adoption strategy that make these patterns work in production environments.
Infrastructure Requirements
Compute and Storage
Evaluation infrastructure is the primary cost driver. Running continuous evaluations requires compute to execute test suites and storage for evaluation datasets and results history.
Compute needs:
- CI runners: Existing GitHub Actions or equivalent handles most evaluation workloads
- Optional: Dedicated evaluation cluster for large-scale runs (100+ test cases per PR)
- Model API access: Budget for evaluation API calls (details in Cost section)
Hypothetical storage needs:
- Evaluation datasets: ~100MB per component (50-100 labeled examples with metadata)
- Results history: ~10GB/year for team of 10 (compressed evaluation logs)
- Specifications: Negligible (<1MB per spec, stored in repository)
Cost estimates (monthly, team of 10 developers):
| Cost Category | Conservative | Expected | Notes |
|---|---|---|---|
| Development Evaluation | $100 | $200 | PR evaluations + local testing |
| Production Monitoring | $50 | $100 | Daily evaluation runs on main branch |
| CI Compute | $0 | $25 | GitHub Actions (free tier covers most teams) |
| Storage | $5 | $5 | Evaluation datasets + results history |
| Total Infrastructure | $155/month | $330/month | Scales with team size and PR volume |
Additional operational costs:
- Spec creation: 20 hours/month @ $75/hour = $1,500/month (developer time, not AI costs)
- Dataset maintenance: 10 hours/quarter @ $75/hour = $250/month amortized
Total monthly cost: $1,905 - $2,080
How to start lean:
- Begin with free GitHub Actions tier
- Use cheaper models for initial evaluations (GPT-3.5-turbo, Claude Haiku)
- Run evaluations only on main branch initially, expand to all PRs once patterns stabilize
Model Context Protocol (MCP) Integration
MCP enables AI assistants to fetch documentation on-demand, preventing outdated context from causing deprecated patterns or incorrect API usage.
What it is: MCP servers expose data sources (documentation, APIs, databases) to AI tools through standardized interfaces. Instead of training data from 18 months ago, your AI fetches current documentation when generating code.
Example workflow:
- Developer asks AI: “Generate authentication middleware”
- AI queries MCP server: “Get latest passport.js documentation”
- MCP returns current v0.7.0 docs (not outdated v0.5.0 from training data)
- AI generates code using current patterns
Infrastructure requirements:
- Small cloud VM or container (2 vCPU, 4GB RAM)
- Network access to documentation sources
- API key management for authentication
- Monthly hosting cost: ~$20-50
Setup: For detailed MCP server installation, configuration, and AI tool integration, see the MCP Infrastructure Setup Guide from Anthropic.
Security note: MCP servers access internal documentation. Run them inside your network perimeter, authenticate all requests with API keys, and enable audit logging. Details in Security section below.
Security Considerations
Three Critical Attack Surfaces
AI-assisted development introduces specific security risks beyond traditional code review. Address these before scaling adoption.
1. MCP Supply Chain Risk
MCP servers fetch documentation that AI tools incorporate into code generation. A compromised documentation source becomes an attack vector.
⚠️ Threat Scenario
Attacker injects malicious patterns into documentation fetched by MCP (e.g., “always disable SSL verification for database connections”).
Impact:
- AI incorporates bad practices into generated code
- Vulnerabilities spread across multiple components
- Hard to detect (code looks “normal”)
- Difficult to remediate at scale
🛡️ Mitigations
Prevention:
- Pin MCP server versions and verify checksums on deployment
- Authenticate all documentation sources with mutual TLS
- Run MCP servers in isolated network segments with egress filtering
- Implement content integrity checks (hash documentation responses)
Detection:
- Log all documentation fetches for audit trails
- Alert on unexpected documentation source changes
- Monitor for suspicious patterns in generated code (regex scanning for “disable.*verification”, “skip.*validation”, etc.)
2. Secrets in AI Context
AI tools see everything you paste into prompts or include in context. Secrets accidentally included in specifications or code examples leak to model providers.
Common leak paths:
- Pasting database connection strings into specs as “examples”
- Including
.envfiles in repository context - Hardcoded API keys in specification technical constraints
- Credentials in error messages copied into AI chat
Mitigations:
- Pre-commit hooks scanning for secrets (use git-secrets, gitleaks, or trufflehog)
- Lint specifications for credential patterns before committing
- Never commit credentials to version control - use secret managers
- Encrypt sensitive values in specs with references to secret manager keys
- Configure AI tools to exclude sensitive file patterns (
.env,secrets.yml, etc.)
Example pre-commit configuration:
# .pre-commit-config.yaml
repos:
- repo: https://github.com/Yelp/detect-secrets
rev: v1.4.0
hooks:
- id: detect-secrets
args: ["--baseline", ".secrets.baseline"]
exclude: "(package-lock.json|yarn.lock)"
3. Unvalidated AI-Generated Code
AI-generated code may contain vulnerabilities that pass traditional static analysis:
- SQL injection in dynamically constructed queries
- XSS vulnerabilities in HTML generation
- Authentication bypass in complex conditional logic
- Race conditions in concurrent code
Why traditional tools miss these: Static analyzers catch syntactic patterns (“this looks like SQL injection”) but miss semantic vulnerabilities where the logic is wrong but syntactically valid.
Mitigations:
- Manual security review for all authentication/authorization code
- Penetration testing of AI-generated API endpoints before production
- Runtime security monitoring (WAF, RASP) to catch exploitation attempts
- Structured code review (Triple R pattern from Part 1) with explicit security checklist
Security checklist for reviewers:
- Authentication: Who can call this? Is it enforced?
- Authorization: What data can they access? Are checks explicit?
- Input validation: Are all inputs sanitized? Any injection vectors?
- Output encoding: Is user-generated content escaped?
- Error handling: Do errors leak sensitive information?
Implementation Sequence
Deploy these patterns incrementally. Teams that try to adopt all three patterns simultaneously experience coordination overhead that negates productivity gains.
Phase 1: Spec-Driven Development (Pilot Team)
Timeline: 2-4 sprints
Goal: Establish specification discipline with one team on one component. Prove that persistent specs reduce rework.
Steps:
-
Select pilot component (Week 1)
- Choose isolated feature with clear boundaries
- Must have known requirements volatility (benefits specs most)
- Avoid critical-path or legacy systems (reduce adoption risk)
-
Create first specifications (Week 1-2)
- Start with template: Business Intent, Given/When/Then scenarios, Technical Constraints, Acceptance Criteria
- Store specs in
docs/specs/alongside code - Review specs before implementation (catch requirement gaps early)
-
Iterate on spec format (Week 2-4)
- Developers give feedback on what’s unclear or missing
- Refine template based on real usage
- Document team conventions (e.g., “Technical Constraints must include exact library versions”)
-
Measure rework reduction (Week 4+)
- Count requirement clarifications per story (before/after specs)
- Track PRs requiring re-work due to missed requirements
- If rework drops, expand to more components
Success criteria:
- Specs exist for all new features in pilot component
- Developers reference specs during implementation (not just “checkbox compliance”)
- Measurable reduction in “I didn’t know we needed that” PR feedback
Phase 2: Evaluation-Driven Development (Single Component)
Timeline: 2-3 sprints (can start after Phase 1 Week 2)
Goal: Build evaluation capability for one AI-powered component. Prove that continuous evaluation catches issues before production.
Prerequisites:
- Component with measurable quality criteria (e.g., classification accuracy, extraction precision)
- Baseline labeled dataset (50-100 examples)
- CI infrastructure with model API access
Steps:
-
Create evaluation dataset (Week 1)
- Label 50-100 examples covering happy path and edge cases
- Include adversarial inputs (typos, malformed data, boundary conditions)
- Store in
tests/evaluation/datasets/
-
Set up evaluation framework (Week 1-2)
- Choose tool (Promptfoo recommended for getting started)
- Define metrics: quality (accuracy/precision), latency, cost
- Set thresholds aligned with business requirements
-
Integrate into CI (Week 2)
- Run evaluations on PRs touching AI components
- Fail builds below quality threshold
- Report metrics in PR comments for visibility
-
Establish monitoring (Week 3+)
- Daily evaluation runs on main branch
- Alert on degradation (quality drops >5%, cost spikes >20%)
- Weekly review of evaluation trends
Success criteria:
- Evaluations run automatically on every relevant PR
- At least one issue caught by evaluation before merge
- Team uses metrics to make model/prompt decisions
Phase 3: Structured Code Review (Team-Wide)
Timeline: 1-2 sprints (can start immediately, parallel to other phases)
Goal: Improve review quality and knowledge transfer through Triple R pattern (Request, Rationale, Result).
Steps:
-
Train reviewers (Week 1)
- Share Triple R pattern examples
- Explain why generic “LGTM” fails (no knowledge transfer, misses context-specific issues)
- Practice on historical PRs: rewrite weak comments using Triple R
-
Establish review standards (Week 1)
- Document team expectations (e.g., “all comments must include rationale”)
- Create review checklist covering architecture, security, maintainability
- Set response time expectations (reviews within 24 hours)
-
Implement review process (Week 1-2)
- Require structured comments for approval
- Use review templates in PR description to prompt thoroughness
- Track review quality (how often do issues escape to production?)
-
Iterate based on feedback (Week 2+)
- Developers report which comments were most helpful
- Refine standards based on what works
- Measure knowledge transfer (can junior devs explain architectural decisions?)
Success criteria:
- Reviews contain specific, actionable feedback with clear rationale
- Junior developers report learning from reviews (survey or 1:1s)
- Fewer defects escape to production compared to baseline
Scaling Beyond Pilot Team
Once patterns work for the pilot team, scaling requires coordination and standardization.
Multi-Team Coordination
Challenge: Different teams adopt patterns at different paces. Specs and evaluations become inconsistent across teams.
Solution: Establish shared standards with team autonomy
-
Create pattern playbook
- Document proven specification templates, evaluation configurations, review checklists
- Include examples from pilot team
- Explain the “why” behind each pattern (so teams can adapt to context)
-
Designate pattern champions
- One person per team responsible for pattern adoption
- Champions meet regularly to share learnings and refine standards
- Avoid central enforcement (teams must own their patterns)
-
Share evaluation infrastructure
- Central evaluation cluster reduces per-team setup cost
- Shared dataset storage with team namespaces
- Common metrics dashboards for cross-team visibility
-
Avoid rigid standardization
- Teams working on different product areas need different patterns
- Standardize principles (specs must have acceptance criteria), not specifics (exact template format)
- Allow experimentation and share results
Managing Organizational Change
AI development patterns require cultural shifts that most teams underestimate. Technical implementation is straightforward; getting people to change behavior is hard.
Executive Sponsorship Requirements
Why you need it:
- Pattern adoption requires upfront investment ($15K+ implementation costs) before benefits materialize
- Teams will resist “more documentation” and “slower CI” during the uncomfortable early phase
- Cross-team coordination (shared evaluation infrastructure, spec standards) requires authority
What sponsors must do:
- Defend budget - Infrastructure costs ($155-330/month) and implementation time (200+ hours)
- Enforce compliance - Reject PRs without specs during pilot phase to establish new norms
- Communicate strategic rationale - Explain why short-term friction enables long-term velocity
- Shield pilot teams - Protect them from pressure to “just ship faster” during the learning period
Red flag: If your sponsor isn’t willing to defend a 2-4 sprint pilot with reduced velocity, patterns will fail due to premature abandonment.
Realistic Adoption Timeline
Month 1-2: Pilot team, expect friction
- Developers learn new workflows (specs-first, evaluation-driven)
- Productivity drops 10-20% as team builds new habits
- Many false starts: specs too detailed, evaluations too strict, reviews too nitpicky
- Key milestone: First feature ships with complete spec → eval → review cycle
Month 3-4: Refine patterns, document learnings
- Pilot team identifies what works (keep) vs what doesn’t (discard)
- Create pattern playbook with examples from real work
- Productivity returns to baseline as workflows become muscle memory
- Key milestone: Team can articulate “why” behind each pattern
Month 5-6: Expand to 2-3 teams
- Early adopter teams join, bring fresh perspectives
- Pattern champions meet weekly to share challenges
- Infrastructure scales (more API quota, shared eval cluster)
- Key milestone: Multiple teams using patterns without constant support
Month 7-12: Widespread adoption, benefits compound
- New team members onboard into established patterns
- Metrics show measurable improvements (reduced rework, faster reviews)
- Patterns become “how we work” rather than “that new thing”
- Key milestone: Patterns survive leadership changes and team turnover
Warning: Don’t expect ROI in the first 6 months. Early stages are investment phase - costs without proportional benefits. Conservative ROI scenario (5% return, 20-month payback) assumes this reality.
Managing Resistance
Common objections and detailed responses:
"Specs are just more documentation that gets outdated"
Developers have seen too many spec documents that were written once, never maintained, and became misleading rather than helpful.
- Specs are executable contracts, not traditional documentation. They drive code generation and evaluation - if they’re wrong, tests fail immediately
- When requirements change, update spec first before regenerating code. This keeps spec and code synchronized by design
- Unlike docs, specs have forcing functions: AI generates from them, evaluations test against them, reviewers reference them
- Start with high-value features where requirements volatility makes specs most valuable
- Show side-by-side: scattered Slack messages vs single spec as source of truth
- Measure: count “I thought we agreed to…” conversations before/after specs
- Commit: spec updates in same PR as code changes (not separate doc repo)
"Evaluations slow down development"
CI pipelines already take 10-15 minutes. Adding evaluation runs makes them even slower, and developers hate waiting.
- Fast CI failures (2 minutes for evaluation) prevent slow production failures (2+ hours debugging + incident response + customer impact)
- Run evaluations only on AI-touched files (not every PR). Use path filters:
paths: ['src/ai/**', 'prompts/**'] - Start with 10-20 test cases for rapid feedback, expand to comprehensive suite for pre-merge checks
- Parallel execution: evaluations run concurrently with unit tests, no added wall-clock time
- Measure actual CI time increase (typically <2 minutes)
- Track production incidents before/after evaluations
- Show cumulative time saved from prevented bugs
Idealized cost comparison:
Without evaluations:
- 10 PRs/week ship
- 2 PRs/week have AI issues
- 4 hours debugging each = 8 hours/week wasted
With evaluations:
- 10 PRs/week ship
- 0.5 PRs/week have AI issues (75% reduction)
- 2 minutes/PR for evaluation = 20 minutes/week invested
- 6 hours/week saved = 18x ROI on time "We already do code review"
Teams feel they're already doing thorough reviews. Structured review feels like unnecessary process overhead and micromanagement.
- Current review quality varies wildly by reviewer. Generic “LGTM” approvals are common but provide zero knowledge transfer or quality improvement
- Structured review (Triple R) takes same time but produces measurable outcomes: junior devs learn patterns, architectural decisions are documented
- Without structure, AI-generated code gets rubber-stamped because reviewers can’t quickly assess 500-line PRs
- Audit last 20 PRs: how many reviews explain why changes are needed?
- Interview junior devs: “Do reviews help you learn architecture?”
- Compare review quality between structured vs unstructured reviewers
Data from teams that adopted Triple R:
- Review time: unchanged (~15 min/PR)
- Defects caught: +40%
- Junior dev learning velocity: 2x faster (qualitative assessment via 1:1s)
- Knowledge silos: reduced (multiple people can maintain each component)
"This is too much process for our small team"
Startups and small teams prioritize speed over process. Patterns feel like enterprise bureaucracy.
- Small teams have less margin for error. A single production incident costs proportionally more of your sprint capacity
- Patterns prevent the technical debt that kills velocity as codebases grow
- Start minimal: specs for unclear features only, evaluations for critical AI paths only, structured review for complex logic only
- You can always add process; removing embedded bad patterns is exponentially harder
When to skip patterns:
- Team <3 people (communication overhead exceeds benefit)
- True throwaway prototypes (will rewrite, not maintain)
- Requirements are crystal clear (though rare in practice)
"Our AI code is too simple to need evaluations"
Team is just using AI for basic string formatting or simple transformations. Seems overkill to evaluate 'obvious' correctness.
- Model providers update models without warning. Your “simple” logic breaks when GPT-4 → GPT-4 turbo changes output format
- Edge cases compound. “Simple” email extraction fails on internationalized domains, multiple recipients, embedded emails in quoted text
- Evaluations are insurance. Cost is small ($100/month), payout is large (avoiding production incidents)
- Ask: “What happens if OpenAI updates GPT-4 and changes output format?”
- Show: historical examples of model updates breaking production systems
- Calculate: cost of one production incident vs annual evaluation costs
Start minimal:
- 10 test cases covering happy path + obvious edge cases
- Run on main branch daily (not every PR)
- Expand only if issues are found
Building Pattern Champions Network
Structure:
- One champion per team (typically senior/staff engineer)
- Weekly 30-minute sync to share learnings
- Shared Slack channel for real-time questions
- Monthly playbook updates with new examples
Champion responsibilities:
- Help teammates write specs and evaluations
- Review PRs for pattern compliance
- Escalate blockers (tooling issues, resource constraints)
- Contribute team-specific examples to playbook
Champion incentives:
- Recognized as technical leaders driving quality
- Input into pattern evolution (not top-down mandates)
- Visibility to senior leadership during monthlies
- Career development: mentorship, thought leadership
Warning Signs of Failing Adoption
Watch for these signals that patterns aren’t taking hold:
- Specs written after code - Indicates team sees specs as compliance checkbox, not design tool
- Evaluation suites stagnate - Dataset not updated as production inputs evolve
- Review comments lack rationale - Triple R format not actually used
- Opt-out pressure - Teams asking “can we skip patterns for this one feature?”
- Champion burnout - Single person doing all spec writing/evaluation creation
Recovery actions:
- Re-emphasize executive sponsorship and “why”
- Pair programming sessions to rebuild muscle memory
- Simplify patterns if they’re too heavyweight for your context
- Consider: patterns may not fit your team’s needs - that’s valid too
Tool Stack Overview
You don’t need specialized tools to implement these patterns, but they reduce setup friction.
Specification Management
Git + Markdown (free)
- Store specs alongside code in
docs/specs/ - Version control ensures specs stay synchronized with implementation
- No additional tooling required
Notion/Confluence (if already using for documentation)
- Embed specs in existing team wiki
- Better discoverability for non-developers
- Export to Markdown for AI consumption
Evaluation Frameworks
Promptfoo (open-source)
- Battle-tested evaluation framework supporting multiple model providers
- Built-in CI integration
- Strong community and documentation
- Start here unless you have specific needs
Braintrust (commercial)
- Production-grade evaluation platform with advanced analytics
- Better for teams running large-scale evaluations (1000+ test cases)
- Includes drift detection and alerting
LangSmith (commercial, LangChain ecosystem)
- Integrated with LangChain development workflows
- Good for teams already using LangChain
- Strong tracing and debugging capabilities
Code Review Tools
GitHub/GitLab built-in review (included)
- Sufficient for structured review using Triple R pattern
- No additional tooling needed
AI review assistants (optional, reduces manual effort):
- Qodo: AI reviewer with codebase context (RAG-based)
- CodeRabbit: Automated review suggestions on every PR
- Greptile: Natural language codebase search for reviewers
These tools don’t replace human review - they pre-filter trivial issues (formatting, simple bugs) so reviewers focus on architecture and business logic.
What’s Next
You now have the infrastructure, security, and adoption strategy to implement AI development patterns in production.
Ready for production operations?
- Part 3: Production Operations covers observability, incident response, and measuring ROI once these patterns are running at scale
Glossary
Glossary
- A security protocol where both client and server authenticate each other using certificates. More secure than one-way TLS because it prevents unauthorized clients from accessing services.
- Network security controls that restrict which external destinations internal services can connect to. Prevents compromised services from exfiltrating data or connecting to attacker-controlled servers.
- Scripts that run automatically before git commits, checking for issues like secrets, linting violations, or test failures. Prevents problematic code from entering version control.
- Automated pipelines that test, build, and deploy code changes. CI runs tests on every commit; CD automatically deploys passing changes to production.
- The sum of all points where an unauthorized user could try to enter or extract data from a system. Larger attack surfaces create more security risk.
- An AI architecture that fetches relevant documents before generating responses, combining retrieval systems with language models for more accurate, grounded outputs.
- A specialized database for storing and querying high-dimensional vectors (embeddings). Used in AI systems to find semantically similar content.
References
- , "Spec-Driven Development: Using Markdown as a Programming Language" , September 30, 2025. https://github.blog/ai-and-ml/generative-ai/spec-driven-development-using-markdown-as-a-programming-language-when-building-with-ai/
- , "Model Context Protocol Documentation" , 2025. https://modelcontextprotocol.io/
- , "AI Security and Privacy Guide" , 2024. https://owasp.org/www-project-ai-security-and-privacy-guide/
- , "LLM Evaluation Framework Documentation" , 2025. https://www.promptfoo.dev/docs/intro
- , "Securing AI-Generated Code in DevOps Pipelines" , July 15, 2025. https://developers.redhat.com/articles/2025/07/15/securing-ai-generated-code