This is a companion to 8 Patterns for Spec-Driven Development. While that article covers practical workflow patterns, this one dives into the ML-informed techniques that explain why certain approaches work and how to optimize them.
⚠️ Research Note (Nov 2025): Much of the foundational research cited here (attention mechanisms, self-consistency, RAG) was conducted on 2023-2024 models (GPT-3.5-turbo, GPT-4-0125). The newest models (Claude Opus 4.5, GPT-5.1) were released in late 2025 and may exhibit different behaviors. The core principles remain valid, but specific numbers and thresholds should be validated for your workload.
Getting Started: Prioritize by Impact
Before diving into the details, here’s how to sequence these patterns for maximum impact:
Free Optimizations (Immediate Impact)
Start here—no cost increase, immediate results:
- Attention-aware prompt ordering (Part 1): Move critical constraints to beginning AND end of prompts
- Checkpointing (Part 1): Add state persistence to long-running tasks
- Positive instruction framing (Part 1): Replace negative prompts with specific alternatives
Cost Reduction (50-90% Savings)
Once free optimizations are stable:
- Prompt caching (Part 2): Cache static context (system prompts, docs, tool definitions)
- Model routing (Part 2): Use cheaper models for routine tasks (15-25x cost reduction on 60%+ of tasks)
Quality Improvements (Measured Gains)
After you have baseline metrics:
- Evaluation suite (Part 3): Build 10-100 test cases with deterministic grading where possible
- Self-consistency (Part 3): Apply to critical decisions only (3-5x cost, +17% accuracy on reasoning tasks)
- RAG pipeline (Part 4): If knowledge base exceeds 200K tokens or requires frequent updates
Don’t Implement Everything
These are options, not requirements. Profile your workload, measure impact, prioritize accordingly.
Operational Complexity vs Benefit
Before implementing, consider maintenance burden:
| Pattern | Maintenance | Benefit | When to Use |
|---|---|---|---|
| Prompt ordering | None | High | Always (free) |
| Caching (explicit) | Low | High | Repeated calls >2-3x |
| Caching (automatic) | None | Medium | OpenAI users (automatic) |
| Model routing | Medium | High | >1K requests/day with mixed workload |
| Self-consistency | Medium | Medium | Critical accuracy needs only |
| RAG pipeline | High | High | Large/dynamic knowledge base |
| Circuit breakers | Medium | Medium | Production reliability critical |
Start with high benefit, low complexity patterns. Measure impact before adding operational burden.
Part 1: Free optimizations
Start here. These patterns require no infrastructure, no cost increase, and provide immediate value.
Attention-Aware Prompt Ordering
The “Lost in the Middle” Problem
The Research: In 2023, researchers at Stanford published “Lost in the Middle: How Language Models Use Long Contexts” (Liu et al., TACL 2024). Their findings fundamentally changed how we should structure prompts.
The Core Finding: LLM performance follows a U-shaped curve based on information position. Models perform best when relevant information appears at the beginning (primacy) or end (recency) of context, with significant degradation for middle-positioned information.
Concrete Numbers (from GPT-3.5-turbo-16k testing):
- Nearly 20% accuracy drop when handling 30 documents vs 5
- When key information was in the middle, performance dropped below closed-book baseline—worse than having no documents at all
- The effect intensifies with context size
Note: This research tested GPT-3.5-turbo (2023). Newer models (GPT-4o, Claude Sonnet 4.5) handle long contexts better but still exhibit U-shaped attention patterns. The principle remains: place critical information at beginning and end.
Why Position Matters
Transformers use Attention(Q, K, V) = softmax(QK^T / √d_k) V. The softmax creates zero-sum competition—attention weights must sum to 1 across all positions. As context grows, each token competes with more tokens, diluting attention.
Additionally, positional encodings (RoPE) make distant tokens less similar, and causal masking biases models toward early tokens. Result: U-shaped attention curve.
Attention Mechanism: From Formula to "Lost in the Middle"
See how Attention(Q, K, V) = softmax(QKT / √dk) V creates the U-shaped attention pattern.
Each token shows average attention received across all query
positions—demonstrating why information at start/end is processed better than
the middle.
How It Works
Practical Application: Attention-Aware Prompt Structure
Structure your prompts to leverage attention patterns:
| Position | Attention | Ideal type of information |
|---|---|---|
| Beginning | High | Critical instructions, system prompt, architecture decisions |
| Middle | Lower | Supporting context, documentation, examples, history |
| End | High | Current task, key constraints, “Remember: use Svelte 5 runes only” |
OpenAI’s official guidance (GPT-4.1 Prompting Guide):
“If you have long context in your prompt, ideally place your instructions at both the beginning and end of the provided context.”
Key technique: Repeat critical constraints at the END of your prompt, not just the beginning. Common mistake: putting all critical instructions in system prompt only. Models don’t reliably prioritize system over user prompts (Instruction Hierarchy research)—recency bias makes final instructions most influential.
Context Budget: Less Can Be More
More context ≠ better results. Empirical thresholds from Databricks research (Nov 2024): Llama-3.1-405B degrades after 32K tokens, GPT-4-0125 after 64K. Even perfect retrieval with irrelevant padding causes 24% accuracy drop.
Priority-based context allocation: Critical instructions (beginning + end) → Current task (near end) → Reference docs (middle) → Historical context (summarize or omit).
Alternative to context stuffing: Pinecone research found RAG preserved 95% accuracy using only 25% of tokens—75% cost reduction with minimal quality loss.
Checkpointing Long-Running Tasks
The Problem
AI tasks fail. Networks timeout, rate limits hit, processes crash. Without checkpointing, you restart from scratch.
The Solution
Design as state machines with frequent checkpoints:
class RefactoringWorkflow:
def __init__(self, checkpoint_path: str):
self.checkpoint_path = checkpoint_path
self.state = self._load_or_init()
def checkpoint(self):
with open(self.checkpoint_path, 'w') as f:
json.dump({
"phase": self.phase,
"completed_files": self.completed_files,
"pending_files": self.pending_files,
}, f)
async def run(self):
if self.state["phase"] == "analysis":
await self._analyze()
self.state["phase"] = "transform"
self.checkpoint()
if self.state["phase"] == "transform":
for file in self.pending_files:
await self._transform(file)
self.completed_files.append(file)
self.checkpoint() # After EACH file
Why It Works
Checkpointing after each logical unit means failures cost seconds, not hours. Resume from the last successful state.
Positive Instruction Framing
The Problem
Negative instructions (“don’t do X”) are weaker than positive alternatives (“do Y instead”).
The Solution
Positive vs negative framing:
| Negative (Weaker) | Positive (Stronger) |
|---|---|
| “Don’t write too much detail" | "Provide a concise 2-3 sentence summary" |
| "Avoid technical jargon" | "Use clear, simple language" |
| "Never include opinions" | "Stick to factual statements only” |
When negative prompting IS effective: Hard safety constraints with explicit alternatives:
## FORBIDDEN (will cause build failures)
- `onMount` — use `$effect()` instead
- `$:` reactive declarations — use `$derived()` instead
If you find yourself about to write `onMount`, STOP.
This is Svelte 5. The correct pattern is `$effect()`.
Research finding (Instruction Hierarchy paper, April 2024): Models don’t reliably prioritize system prompts over user prompts, especially when social cues are present.
Practical implication: Don’t rely solely on system prompt hierarchy. Reinforce critical constraints in user messages with positive framing.
Part 2: Cost Optimization
Once free optimizations are stable, tackle cost. These patterns can reduce API expenses by 50-90%.
Prompt Caching: 50-90% Savings on Repeat Calls
Both major providers offer prompt caching. Key insight: Structure prompts with static content first, variable content last.
Provider Comparison
| Feature | Anthropic Claude | OpenAI GPT |
|---|---|---|
| Activation | Explicit (mark cacheable blocks) | Automatic |
| Cache hit | 90% discount | 50% discount |
| Cache write | 1.25x (5-min TTL) or 2x (1-hr) | No premium |
| Min tokens | 1,024-4,096 | 1,024 |
| TTL | 5 min or 1 hr | 5-10 min |
| Break-even | 2-3 hits (due to write premium) | Immediate (no write premium) |
| Invalidates | Any content change, tool changes | Any prefix change |
Structure for maximum hits (both providers):
# Static content FIRST (cacheable)
system_prompt + claude_md + tool_definitions + static_context
# Variable content LAST (not cached)
current_task + recent_messages
When it pays off: Anthropic’s 90% discount requires 2-3 repeat calls to offset write premium; OpenAI’s 50% discount is immediate. Both deliver significant savings when you have repeated calls with static context.
Model Routing: Use Expensive Models Only When Needed
Not every task needs your most powerful model. RouteLLM research (LMSYS/UC Berkeley, July 2024) demonstrated:
- 85% cost reduction with only 5% quality loss
- Matrix factorization router achieved 95% of GPT-4 performance using only 26% GPT-4 calls
Task-to-Model Mapping
| Task Type | Model Tier | Why |
|---|---|---|
| Complex reasoning/architecture | Opus 4.5 / GPT-5.1 | Requires deep analysis |
| Standard implementation | Sonnet 4.5 / GPT-4o | Good enough, faster |
| Simple transforms/formatting | Haiku 4.5 / 4o-mini | 30x+ cheaper |
| Classification/embeddings | Specialized | Purpose-built |
Current Pricing Comparison (Nov 2025)
| Model | Input/MTok | Output/MTok | Released |
|---|---|---|---|
| Claude Opus 4.5 | $5.00 | $25.00 | Nov 2025 |
| Claude Sonnet 4.5 | $3.00 | $15.00 | Sep 2025 |
| Claude Haiku 4.5 | $1.00 | $5.00 | Oct 2025 |
| GPT-5.1 | $1.25 | $10.00 | Late 2025 |
| GPT-4o | $2.50 | $10.00 | May 2024 |
| GPT-4o-mini | $0.15 | $0.60 | Jul 2024 |
GPT-4o-mini is 30x cheaper on input than GPT-5.1. Haiku 4.5 offers similar pricing for the Claude family.
Implementation Patterns
- Classifier-based: Train small model to predict which tasks need expensive models
- Heuristic: Route based on query length, detected task type, keywords
- Cascading: Start cheap, escalate on low confidence
Part 3: Quality Assurance
After establishing baseline metrics, layer in quality improvements. These cost more but deliver measurable gains.
Self-Consistency: Multiple Samples, Majority Vote
Original research: “Self-Consistency Improves Chain of Thought Reasoning” (Wang et al., Google Research, ICLR 2023)
How It Works
- Sample multiple reasoning paths (n=5-40) with temperature >0
- Extract final answer from each path
- Return the most common answer (majority vote)
Performance Improvements
| Benchmark | Improvement |
|---|---|
| GSM8K (Grade School Math 8K) | +17.9% |
| SVAMP (Math Word Problems) | +11.0% |
| AQuA (Arithmetic Reasoning) | +12.2% |
| StrategyQA (Commonsense Reasoning) | +6.4% |
When It’s Worth the 3-5x Cost
- Complex multi-step reasoning
- High accuracy is critical
- Problem has verifiable correct answer
- Baseline accuracy is 40-70% (diminishing returns at extremes)
Implementation
def self_consistency(prompt: str, n_samples: int = 5, temperature: float = 0.7):
responses = []
for _ in range(n_samples):
response = llm.generate(prompt, temperature=temperature)
answer = extract_answer(response)
responses.append(answer)
from collections import Counter
return Counter(responses).most_common(1)[0][0]
Cost-benefit: Self-consistency costs 3-5x but improves accuracy by 11-18% on reasoning benchmarks. Use only for high-stakes decisions where accuracy justifies cost. For routine tasks, single-pass with temperature=0 is sufficient.
LLM-as-Judge: Automated Evaluation
Using one model to evaluate another’s output. Research shows:
- Prometheus model: 0.897 Pearson correlation with human evaluators
- Best models: 0.723-0.803 intra-judge consistency (McDonald’s omega)
Known Biases to Account For
- Position bias: Prefers responses in certain positions
- Verbosity bias: Longer = rated higher regardless of quality
- Self-enhancement: Models rate their own outputs higher
Rubric Design Best Practices
evaluate:
criteria:
- name: "Correctness"
description: "Does the answer solve the problem?"
scale: "0-4 points"
examples:
4: "Completely correct, handles edge cases"
2: "Mostly correct, minor issues"
0: "Fundamentally wrong"
- name: "Clarity"
description: "Is the explanation understandable?"
scale: "0-3 points"
Tools: Promptfoo (open source, multi-provider), Braintrust (platform), OpenAI Evals, Anthropic Console
Evaluation-Driven Development
Treat prompts like code—with tests.
Three Evaluation Types
- Deterministic: String matching, regex, JSON schema validation. Fast, reliable. Use when possible.
- Semantic: Embedding-based similarity. Good for paraphrase detection.
- LLM-as-Judge: Model evaluates against rubric. Handles nuance, but slower/costlier.
Anthropic’s recommendation (cookbook):
“Deterministic grading is by far the best method if you can design an eval that allows for it.”
Building Eval Suites
- Start with 10-100 production examples (quality over quantity early on)
- Include: known good results, edge cases, failure scenarios
- Four components per test: input, expected output, actual output, score
- Run on every prompt iteration
- Set quantitative thresholds for “breaking changes” (e.g., <5% accuracy regression)
Retrieval-Augmented Generation
RAG is powerful but complex. Only implement when the problem justifies the infrastructure.
RAG vs Full Context: Decision Matrix
Don’t build RAG just because it’s popular. Choose based on your actual constraints:
| Your Situation | Recommended Approach | Why |
|---|---|---|
| KB < 100K tokens | Full context + prompt caching | Simpler, no retrieval complexity or infrastructure |
| KB 100-200K tokens, static | Full context | Most models handle this well |
| KB > 200K tokens OR dynamic updates | RAG | Beyond context limits or needs fresh data |
| Compartmentalized data (per-customer) | RAG | Retrieve only relevant partition |
| Need cross-document reasoning | Full context | RAG fragments information, hurts synthesis |
| Updates rare (monthly/quarterly) | Full context, regenerate | No infrastructure overhead |
| Need atomic transactions | Full context | RAG introduces eventual consistency |
| Team lacks ML expertise | Full context → migrate later | Operational burden too high initially |
| Latency-critical (<200ms) | Full context | RAG adds 200-500ms retrieval overhead |
| Cost-sensitive, large KB | RAG | 75% cheaper with 95% accuracy (Pinecone research) |
RAG infrastructure costs: Embedding pipeline maintenance + vector DB ($0.09-0.10/GB/month) + increased debugging complexity. Only worthwhile when KB size or freshness requirements justify it.
Chunking Strategies
How you split documents determines retrieval quality.
For prose:
| Strategy | Best For | Size |
|---|---|---|
| Fixed Token | Simple docs | 256-512 |
| Recursive Character | General | 400-512 |
| Semantic | Topic changes | Variable |
For code:
- Function/Class Level: Natural boundaries
- AST-Based: Respects syntax structure
- Hybrid: Function + docstring as unit
Chroma research (2024): RecursiveCharacterTextSplitter with 400-512 tokens delivered 85-90% recall in empirical testing.
Hybrid Search: Semantic + Keyword
Dense retrieval (embeddings) finds semantic matches. Sparse retrieval (BM25) finds exact terms. Combine them:
score = α × dense_score + (1-α) × sparse_score
Anthropic’s Contextual Retrieval finding (2024 announcement): Combining contextual embeddings with contextual BM25 reduced retrieval failures by 49% versus embeddings alone.
Re-ranking for Precision
Initial retrieval optimizes for recall (don’t miss anything). Re-ranking optimizes for precision (surface the best).
Two-stage pipeline:
- Fast retrieval: top-K candidates (K=20-100) using embeddings/BM25
- Cross-encoder re-ranking: final top-N (N=3-10) using pairwise comparison
Models: Cohere Rerank 3.5 (commercial, ~$0.002/1K docs), cross-encoder/ms-marco-MiniLM (open source)
The Seven RAG Failure Points
From “Seven Failure Points When Engineering a RAG System” (Barnett et al., January 2024):
| Failure | Description | Mitigation |
|---|---|---|
| Missing Content | Answer not in KB | Coverage audits |
| Missed Top Ranks | Answer ranked too low | Better embeddings, larger K |
| Not in Context | Retrieved but truncated | Smarter consolidation |
| Not Extracted | In context, LLM missed it | Reduce noise, reranking |
| Wrong Format | Ignored formatting | Explicit examples |
| Incorrect Specificity | Too general/detailed | Query analysis |
| Incomplete | Partial answer | Comprehensive prompts |
Advanced Patterns
These patterns solve specific problems. Implement only when needed.
Structured Outputs: 100% Schema Compliance
The problem with “return JSON”: Models trained to output JSON still fail ~35% of the time on complex schemas (OpenAI benchmark).
Constrained decoding solves this by filtering token probabilities at generation time. The model literally cannot produce invalid output.
How It Works Technically
- Schema converted to Context-Free Grammar (CFG)
- At each token generation step, invalid tokens are masked
- Model samples only from valid continuations
OpenAI Implementation
response = client.chat.completions.create(
model="gpt-4o-2024-08-06",
messages=[{"role": "user", "content": "Extract meeting details"}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "meeting",
"strict": True, # Enables constrained decoding
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"date": {"type": "string"},
"attendees": {"type": "array", "items": {"type": "string"}}
},
"required": ["title", "date", "attendees"],
"additionalProperties": False
}
}
}
)
Anthropic added structured outputs in November 2025 for Claude Sonnet 4.5 and Opus 4.1.
Critical Limitation
Structured outputs can degrade reasoning. Research paper “Let Me Speak Freely?” (August 2024) tested GPT-3.5-Turbo and found it achieved 75.99% on GSM8K in text but only 49.25% with JSON constraints—a 26.74 percentage point drop. While newer models (GPT-5.1, Claude Opus 4.5) handle constraints better, the principle remains: structured formats can constrain reasoning. Always test both approaches for reasoning-heavy tasks.
Solution: Two-Stage Approach
Reason in natural language first, then extract to structure:
# Stage 1: Natural language reasoning
reasoning = await llm.complete("Think through this problem step by step...")
# Stage 2: Structured extraction
structured = await llm.parse(
f"Extract the final answer from: {reasoning}",
response_format=AnswerSchema
)
Platform-Specific Continuity
OpenAI Responses API
# First request
response = client.responses.create(model="gpt-4o", input="Analyze this...")
session_id = response.id
# Continue later
response = client.responses.create(
model="gpt-4o",
input="Now implement...",
previous_response_id=session_id
)
Source: OpenAI Conversation State API
Claude Code CLI
claude --continue # Resume most recent
claude --resume abc123 # Resume specific session
/compact # Summarize to save tokens
LangGraph Checkpointing
from langgraph.checkpoint.postgres import PostgresSaver
checkpointer = PostgresSaver.from_conn_string(DATABASE_URL)
graph = builder.compile(checkpointer=checkpointer)
config = {"configurable": {"thread_id": "my-workflow"}}
result = graph.invoke(state, config)
# Later: Resume from checkpoint
history = graph.get_state_history(config)
Source: LangGraph Persistence Docs
Error Recovery Patterns
Retry with Exponential Backoff
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=60),
retry=retry_if_exception_type((RateLimitError, TimeoutError))
)
async def call_llm(prompt: str):
return await client.complete(prompt)
Fallback Prompts
prompts = [
"Detailed analysis with code examples...", # Primary
"Brief analysis of key points...", # Fallback 1
"List 3 main issues...", # Fallback 2
]
for prompt in prompts:
try:
return await call_llm(prompt)
except ValidationError:
continue # Try simpler prompt
Circuit Breaker
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=300):
self.failures = 0
self.threshold = failure_threshold
self.timeout = recovery_timeout
self.state = "closed"
async def call(self, func, *args):
if self.state == "open":
if time_since_last_failure > self.timeout:
self.state = "half-open"
else:
raise CircuitOpenError()
try:
result = await func(*args)
self.failures = 0
self.state = "closed"
return result
except Exception:
self.failures += 1
if self.failures >= self.threshold:
self.state = "open"
raise
Source: Portkey Reliability Guide
Graceful Degradation
providers = [
{"name": "claude-opus", "quality": "high"},
{"name": "claude-sonnet", "quality": "medium"},
{"name": "gpt-4o-mini", "quality": "low"},
]
for provider in providers:
if circuit_breakers[provider["name"]].can_proceed():
try:
return await provider["client"].complete(prompt)
except:
continue
return {"error": "All providers unavailable", "fallback": True}
Summary: Choosing the Right Pattern
| Problem | Pattern | Cost Impact | When to Use |
|---|---|---|---|
| Critical info being ignored | Attention-aware ordering | Free | Always (default practice) |
| High API costs | Prompt caching + model routing | -50 to -90% | Repeated calls, mixed workload |
| Inconsistent JSON output | Structured outputs | Slight + | When schema compliance is critical |
| Need high-stakes accuracy | Self-consistency | 3-5x | Complex reasoning, verifiable answers |
| Large knowledge base | RAG with hybrid search | Variable | >200K tokens or dynamic content |
| Long tasks failing midway | Checkpointing | Free | Any multi-step workflow |
| Provider outages | Circuit breakers + fallbacks | Free | Production reliability critical |
Where to Start
-
Do you have performance/cost problems?
- No: Focus on free optimizations (attention ordering, checkpointing)
- Yes: Measure bottleneck → prompt caching (if repeated calls) OR model routing (if mixed tasks)
-
Do you have quality problems?
- No: Build evaluation suite for baseline metrics
- Yes: Deterministic tests first → self-consistency for critical paths → RAG if knowledge issue
-
Do you have reliability problems?
- Crashes: Add checkpointing
- Provider failures: Add circuit breakers
- Inconsistent outputs: Add structured outputs
Start with the free optimizations (attention ordering, checkpointing). Add caching for cost reduction. Layer in self-consistency and RAG for quality-critical applications only after measuring baseline performance.
Don’t optimize prematurely. Profile, measure, then optimize based on evidence.
This article synthesizes research from Stanford NLP, Google Research, OpenAI, Anthropic, LMSYS, Databricks, and Pinecone. All statistics and claims are verified against peer-reviewed sources or official documentation.