This is a companion to 8 Patterns for Spec-Driven Development. While that article covers practical workflow patterns, this one dives into the ML-informed techniques that explain why certain approaches work and how to optimize them.

⚠️ Research Note (Nov 2025): Much of the foundational research cited here (attention mechanisms, self-consistency, RAG) was conducted on 2023-2024 models (GPT-3.5-turbo, GPT-4-0125). The newest models (Claude Opus 4.5, GPT-5.1) were released in late 2025 and may exhibit different behaviors. The core principles remain valid, but specific numbers and thresholds should be validated for your workload.


Getting Started: Prioritize by Impact

Before diving into the details, here’s how to sequence these patterns for maximum impact:

Free Optimizations (Immediate Impact)

Start here—no cost increase, immediate results:

  1. Attention-aware prompt ordering (Part 1): Move critical constraints to beginning AND end of prompts
  2. Checkpointing (Part 1): Add state persistence to long-running tasks
  3. Positive instruction framing (Part 1): Replace negative prompts with specific alternatives

Cost Reduction (50-90% Savings)

Once free optimizations are stable:

  1. Prompt caching (Part 2): Cache static context (system prompts, docs, tool definitions)
  2. Model routing (Part 2): Use cheaper models for routine tasks (15-25x cost reduction on 60%+ of tasks)

Quality Improvements (Measured Gains)

After you have baseline metrics:

  1. Evaluation suite (Part 3): Build 10-100 test cases with deterministic grading where possible
  2. Self-consistency (Part 3): Apply to critical decisions only (3-5x cost, +17% accuracy on reasoning tasks)
  3. RAG pipeline (Part 4): If knowledge base exceeds 200K tokens or requires frequent updates

Don’t Implement Everything

These are options, not requirements. Profile your workload, measure impact, prioritize accordingly.

Operational Complexity vs Benefit

Before implementing, consider maintenance burden:

PatternMaintenanceBenefitWhen to Use
Prompt orderingNoneHighAlways (free)
Caching (explicit)LowHighRepeated calls >2-3x
Caching (automatic)NoneMediumOpenAI users (automatic)
Model routingMediumHigh>1K requests/day with mixed workload
Self-consistencyMediumMediumCritical accuracy needs only
RAG pipelineHighHighLarge/dynamic knowledge base
Circuit breakersMediumMediumProduction reliability critical

Start with high benefit, low complexity patterns. Measure impact before adding operational burden.


Part 1: Free optimizations

Start here. These patterns require no infrastructure, no cost increase, and provide immediate value.


Attention-Aware Prompt Ordering

The “Lost in the Middle” Problem

The Research: In 2023, researchers at Stanford published “Lost in the Middle: How Language Models Use Long Contexts” (Liu et al., TACL 2024). Their findings fundamentally changed how we should structure prompts.

The Core Finding: LLM performance follows a U-shaped curve based on information position. Models perform best when relevant information appears at the beginning (primacy) or end (recency) of context, with significant degradation for middle-positioned information.

Concrete Numbers (from GPT-3.5-turbo-16k testing):

  • Nearly 20% accuracy drop when handling 30 documents vs 5
  • When key information was in the middle, performance dropped below closed-book baseline—worse than having no documents at all
  • The effect intensifies with context size

Note: This research tested GPT-3.5-turbo (2023). Newer models (GPT-4o, Claude Sonnet 4.5) handle long contexts better but still exhibit U-shaped attention patterns. The principle remains: place critical information at beginning and end.

Why Position Matters

Transformers use Attention(Q, K, V) = softmax(QK^T / √d_k) V. The softmax creates zero-sum competition—attention weights must sum to 1 across all positions. As context grows, each token competes with more tokens, diluting attention.

Additionally, positional encodings (RoPE) make distant tokens less similar, and causal masking biases models toward early tokens. Result: U-shaped attention curve.

Attention Mechanism: From Formula to "Lost in the Middle"

See how Attention(Q, K, V) = softmax(QKT / √dk) V creates the U-shaped attention pattern. Each token shows average attention received across all query positions—demonstrating why information at start/end is processed better than the middle.

Avg. Attention (Start 25%)
--
Primacy bias
Avg. Attention (Middle 50%)
--
Lost in middle
Avg. Attention (End 25%)
--
Recency bias
Middle Accuracy Loss
--
Accuracy loss
Attention Distribution Curve (From Actual Softmax)
How It Works
1
Compute scores: QKT produces attention logits based on query-key similarity
2
Scale: Divide by √d_k = √64 = 8 to prevent extreme values
3
Softmax: Convert to probabilities that sum to 1 (zero-sum competition)
4
Position bias: RoPE positional encoding makes distant tokens less similar → lower attention
💡 Why the U-shape? Positional encodings (RoPE) create stronger similarity between queries and nearby keys. Middle tokens compete with many neighbors, diluting their attention. Start/end tokens have fewer competitors.

Practical Application: Attention-Aware Prompt Structure

Structure your prompts to leverage attention patterns:

PositionAttentionIdeal type of information
BeginningHighCritical instructions, system prompt, architecture decisions
MiddleLowerSupporting context, documentation, examples, history
EndHighCurrent task, key constraints, “Remember: use Svelte 5 runes only”

OpenAI’s official guidance (GPT-4.1 Prompting Guide):

“If you have long context in your prompt, ideally place your instructions at both the beginning and end of the provided context.”

Key technique: Repeat critical constraints at the END of your prompt, not just the beginning. Common mistake: putting all critical instructions in system prompt only. Models don’t reliably prioritize system over user prompts (Instruction Hierarchy research)—recency bias makes final instructions most influential.

Context Budget: Less Can Be More

More context ≠ better results. Empirical thresholds from Databricks research (Nov 2024): Llama-3.1-405B degrades after 32K tokens, GPT-4-0125 after 64K. Even perfect retrieval with irrelevant padding causes 24% accuracy drop.

Priority-based context allocation: Critical instructions (beginning + end) → Current task (near end) → Reference docs (middle) → Historical context (summarize or omit).

Alternative to context stuffing: Pinecone research found RAG preserved 95% accuracy using only 25% of tokens—75% cost reduction with minimal quality loss.


Checkpointing Long-Running Tasks

The Problem

AI tasks fail. Networks timeout, rate limits hit, processes crash. Without checkpointing, you restart from scratch.

The Solution

Design as state machines with frequent checkpoints:

class RefactoringWorkflow:
    def __init__(self, checkpoint_path: str):
        self.checkpoint_path = checkpoint_path
        self.state = self._load_or_init()

    def checkpoint(self):
        with open(self.checkpoint_path, 'w') as f:
            json.dump({
                "phase": self.phase,
                "completed_files": self.completed_files,
                "pending_files": self.pending_files,
            }, f)

    async def run(self):
        if self.state["phase"] == "analysis":
            await self._analyze()
            self.state["phase"] = "transform"
            self.checkpoint()

        if self.state["phase"] == "transform":
            for file in self.pending_files:
                await self._transform(file)
                self.completed_files.append(file)
                self.checkpoint()  # After EACH file

Why It Works

Checkpointing after each logical unit means failures cost seconds, not hours. Resume from the last successful state.


Positive Instruction Framing

The Problem

Negative instructions (“don’t do X”) are weaker than positive alternatives (“do Y instead”).

The Solution

Positive vs negative framing:

Negative (Weaker)Positive (Stronger)
“Don’t write too much detail""Provide a concise 2-3 sentence summary"
"Avoid technical jargon""Use clear, simple language"
"Never include opinions""Stick to factual statements only”

When negative prompting IS effective: Hard safety constraints with explicit alternatives:

## FORBIDDEN (will cause build failures)

- `onMount` — use `$effect()` instead
- `$:` reactive declarations — use `$derived()` instead

If you find yourself about to write `onMount`, STOP.
This is Svelte 5. The correct pattern is `$effect()`.

Research finding (Instruction Hierarchy paper, April 2024): Models don’t reliably prioritize system prompts over user prompts, especially when social cues are present.

Practical implication: Don’t rely solely on system prompt hierarchy. Reinforce critical constraints in user messages with positive framing.


Part 2: Cost Optimization

Once free optimizations are stable, tackle cost. These patterns can reduce API expenses by 50-90%.


Prompt Caching: 50-90% Savings on Repeat Calls

Both major providers offer prompt caching. Key insight: Structure prompts with static content first, variable content last.

Provider Comparison

FeatureAnthropic ClaudeOpenAI GPT
ActivationExplicit (mark cacheable blocks)Automatic
Cache hit90% discount50% discount
Cache write1.25x (5-min TTL) or 2x (1-hr)No premium
Min tokens1,024-4,0961,024
TTL5 min or 1 hr5-10 min
Break-even2-3 hits (due to write premium)Immediate (no write premium)
InvalidatesAny content change, tool changesAny prefix change

Structure for maximum hits (both providers):

# Static content FIRST (cacheable)
system_prompt + claude_md + tool_definitions + static_context

# Variable content LAST (not cached)
current_task + recent_messages

When it pays off: Anthropic’s 90% discount requires 2-3 repeat calls to offset write premium; OpenAI’s 50% discount is immediate. Both deliver significant savings when you have repeated calls with static context.


Model Routing: Use Expensive Models Only When Needed

Not every task needs your most powerful model. RouteLLM research (LMSYS/UC Berkeley, July 2024) demonstrated:

  • 85% cost reduction with only 5% quality loss
  • Matrix factorization router achieved 95% of GPT-4 performance using only 26% GPT-4 calls

Task-to-Model Mapping

Task TypeModel TierWhy
Complex reasoning/architectureOpus 4.5 / GPT-5.1Requires deep analysis
Standard implementationSonnet 4.5 / GPT-4oGood enough, faster
Simple transforms/formattingHaiku 4.5 / 4o-mini30x+ cheaper
Classification/embeddingsSpecializedPurpose-built

Current Pricing Comparison (Nov 2025)

ModelInput/MTokOutput/MTokReleased
Claude Opus 4.5$5.00$25.00Nov 2025
Claude Sonnet 4.5$3.00$15.00Sep 2025
Claude Haiku 4.5$1.00$5.00Oct 2025
GPT-5.1$1.25$10.00Late 2025
GPT-4o$2.50$10.00May 2024
GPT-4o-mini$0.15$0.60Jul 2024

GPT-4o-mini is 30x cheaper on input than GPT-5.1. Haiku 4.5 offers similar pricing for the Claude family.

Implementation Patterns

  1. Classifier-based: Train small model to predict which tasks need expensive models
  2. Heuristic: Route based on query length, detected task type, keywords
  3. Cascading: Start cheap, escalate on low confidence

Part 3: Quality Assurance

After establishing baseline metrics, layer in quality improvements. These cost more but deliver measurable gains.


Self-Consistency: Multiple Samples, Majority Vote

Original research: “Self-Consistency Improves Chain of Thought Reasoning” (Wang et al., Google Research, ICLR 2023)

How It Works

  1. Sample multiple reasoning paths (n=5-40) with temperature >0
  2. Extract final answer from each path
  3. Return the most common answer (majority vote)

Performance Improvements

BenchmarkImprovement
GSM8K (Grade School Math 8K)+17.9%
SVAMP (Math Word Problems)+11.0%
AQuA (Arithmetic Reasoning)+12.2%
StrategyQA (Commonsense Reasoning)+6.4%

When It’s Worth the 3-5x Cost

  • Complex multi-step reasoning
  • High accuracy is critical
  • Problem has verifiable correct answer
  • Baseline accuracy is 40-70% (diminishing returns at extremes)

Implementation

def self_consistency(prompt: str, n_samples: int = 5, temperature: float = 0.7):
    responses = []
    for _ in range(n_samples):
        response = llm.generate(prompt, temperature=temperature)
        answer = extract_answer(response)
        responses.append(answer)

    from collections import Counter
    return Counter(responses).most_common(1)[0][0]

Cost-benefit: Self-consistency costs 3-5x but improves accuracy by 11-18% on reasoning benchmarks. Use only for high-stakes decisions where accuracy justifies cost. For routine tasks, single-pass with temperature=0 is sufficient.


LLM-as-Judge: Automated Evaluation

Using one model to evaluate another’s output. Research shows:

  • Prometheus model: 0.897 Pearson correlation with human evaluators
  • Best models: 0.723-0.803 intra-judge consistency (McDonald’s omega)

Known Biases to Account For

  1. Position bias: Prefers responses in certain positions
  2. Verbosity bias: Longer = rated higher regardless of quality
  3. Self-enhancement: Models rate their own outputs higher

Rubric Design Best Practices

evaluate:
  criteria:
    - name: "Correctness"
      description: "Does the answer solve the problem?"
      scale: "0-4 points"
      examples:
        4: "Completely correct, handles edge cases"
        2: "Mostly correct, minor issues"
        0: "Fundamentally wrong"
    - name: "Clarity"
      description: "Is the explanation understandable?"
      scale: "0-3 points"

Tools: Promptfoo (open source, multi-provider), Braintrust (platform), OpenAI Evals, Anthropic Console


Evaluation-Driven Development

Treat prompts like code—with tests.

Three Evaluation Types

  1. Deterministic: String matching, regex, JSON schema validation. Fast, reliable. Use when possible.
  2. Semantic: Embedding-based similarity. Good for paraphrase detection.
  3. LLM-as-Judge: Model evaluates against rubric. Handles nuance, but slower/costlier.

Anthropic’s recommendation (cookbook):

“Deterministic grading is by far the best method if you can design an eval that allows for it.”

Building Eval Suites

  1. Start with 10-100 production examples (quality over quantity early on)
  2. Include: known good results, edge cases, failure scenarios
  3. Four components per test: input, expected output, actual output, score
  4. Run on every prompt iteration
  5. Set quantitative thresholds for “breaking changes” (e.g., <5% accuracy regression)

Retrieval-Augmented Generation

RAG is powerful but complex. Only implement when the problem justifies the infrastructure.

RAG vs Full Context: Decision Matrix

Don’t build RAG just because it’s popular. Choose based on your actual constraints:

Your SituationRecommended ApproachWhy
KB < 100K tokensFull context + prompt cachingSimpler, no retrieval complexity or infrastructure
KB 100-200K tokens, staticFull contextMost models handle this well
KB > 200K tokens OR dynamic updatesRAGBeyond context limits or needs fresh data
Compartmentalized data (per-customer)RAGRetrieve only relevant partition
Need cross-document reasoningFull contextRAG fragments information, hurts synthesis
Updates rare (monthly/quarterly)Full context, regenerateNo infrastructure overhead
Need atomic transactionsFull contextRAG introduces eventual consistency
Team lacks ML expertiseFull context → migrate laterOperational burden too high initially
Latency-critical (<200ms)Full contextRAG adds 200-500ms retrieval overhead
Cost-sensitive, large KBRAG75% cheaper with 95% accuracy (Pinecone research)

RAG infrastructure costs: Embedding pipeline maintenance + vector DB ($0.09-0.10/GB/month) + increased debugging complexity. Only worthwhile when KB size or freshness requirements justify it.


Chunking Strategies

How you split documents determines retrieval quality.

For prose:

StrategyBest ForSize
Fixed TokenSimple docs256-512
Recursive CharacterGeneral400-512
SemanticTopic changesVariable

For code:

  • Function/Class Level: Natural boundaries
  • AST-Based: Respects syntax structure
  • Hybrid: Function + docstring as unit

Chroma research (2024): RecursiveCharacterTextSplitter with 400-512 tokens delivered 85-90% recall in empirical testing.


Hybrid Search: Semantic + Keyword

Dense retrieval (embeddings) finds semantic matches. Sparse retrieval (BM25) finds exact terms. Combine them:

score = α × dense_score + (1-α) × sparse_score

Anthropic’s Contextual Retrieval finding (2024 announcement): Combining contextual embeddings with contextual BM25 reduced retrieval failures by 49% versus embeddings alone.


Re-ranking for Precision

Initial retrieval optimizes for recall (don’t miss anything). Re-ranking optimizes for precision (surface the best).

Two-stage pipeline:

  1. Fast retrieval: top-K candidates (K=20-100) using embeddings/BM25
  2. Cross-encoder re-ranking: final top-N (N=3-10) using pairwise comparison

Models: Cohere Rerank 3.5 (commercial, ~$0.002/1K docs), cross-encoder/ms-marco-MiniLM (open source)


The Seven RAG Failure Points

From “Seven Failure Points When Engineering a RAG System” (Barnett et al., January 2024):

FailureDescriptionMitigation
Missing ContentAnswer not in KBCoverage audits
Missed Top RanksAnswer ranked too lowBetter embeddings, larger K
Not in ContextRetrieved but truncatedSmarter consolidation
Not ExtractedIn context, LLM missed itReduce noise, reranking
Wrong FormatIgnored formattingExplicit examples
Incorrect SpecificityToo general/detailedQuery analysis
IncompletePartial answerComprehensive prompts

Advanced Patterns

These patterns solve specific problems. Implement only when needed.


Structured Outputs: 100% Schema Compliance

The problem with “return JSON”: Models trained to output JSON still fail ~35% of the time on complex schemas (OpenAI benchmark).

Constrained decoding solves this by filtering token probabilities at generation time. The model literally cannot produce invalid output.

How It Works Technically

  1. Schema converted to Context-Free Grammar (CFG)
  2. At each token generation step, invalid tokens are masked
  3. Model samples only from valid continuations

OpenAI Implementation

response = client.chat.completions.create(
    model="gpt-4o-2024-08-06",
    messages=[{"role": "user", "content": "Extract meeting details"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "meeting",
            "strict": True,  # Enables constrained decoding
            "schema": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "date": {"type": "string"},
                    "attendees": {"type": "array", "items": {"type": "string"}}
                },
                "required": ["title", "date", "attendees"],
                "additionalProperties": False
            }
        }
    }
)

Anthropic added structured outputs in November 2025 for Claude Sonnet 4.5 and Opus 4.1.

Critical Limitation

Structured outputs can degrade reasoning. Research paper “Let Me Speak Freely?” (August 2024) tested GPT-3.5-Turbo and found it achieved 75.99% on GSM8K in text but only 49.25% with JSON constraints—a 26.74 percentage point drop. While newer models (GPT-5.1, Claude Opus 4.5) handle constraints better, the principle remains: structured formats can constrain reasoning. Always test both approaches for reasoning-heavy tasks.

Solution: Two-Stage Approach

Reason in natural language first, then extract to structure:

# Stage 1: Natural language reasoning
reasoning = await llm.complete("Think through this problem step by step...")

# Stage 2: Structured extraction
structured = await llm.parse(
    f"Extract the final answer from: {reasoning}",
    response_format=AnswerSchema
)

Platform-Specific Continuity

OpenAI Responses API

# First request
response = client.responses.create(model="gpt-4o", input="Analyze this...")
session_id = response.id

# Continue later
response = client.responses.create(
    model="gpt-4o",
    input="Now implement...",
    previous_response_id=session_id
)

Source: OpenAI Conversation State API

Claude Code CLI

claude --continue          # Resume most recent
claude --resume abc123     # Resume specific session
/compact                   # Summarize to save tokens

LangGraph Checkpointing

from langgraph.checkpoint.postgres import PostgresSaver

checkpointer = PostgresSaver.from_conn_string(DATABASE_URL)
graph = builder.compile(checkpointer=checkpointer)

config = {"configurable": {"thread_id": "my-workflow"}}
result = graph.invoke(state, config)

# Later: Resume from checkpoint
history = graph.get_state_history(config)

Source: LangGraph Persistence Docs


Error Recovery Patterns

Retry with Exponential Backoff

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=60),
    retry=retry_if_exception_type((RateLimitError, TimeoutError))
)
async def call_llm(prompt: str):
    return await client.complete(prompt)

Fallback Prompts

prompts = [
    "Detailed analysis with code examples...",  # Primary
    "Brief analysis of key points...",          # Fallback 1
    "List 3 main issues...",                    # Fallback 2
]

for prompt in prompts:
    try:
        return await call_llm(prompt)
    except ValidationError:
        continue  # Try simpler prompt

Circuit Breaker

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=300):
        self.failures = 0
        self.threshold = failure_threshold
        self.timeout = recovery_timeout
        self.state = "closed"

    async def call(self, func, *args):
        if self.state == "open":
            if time_since_last_failure > self.timeout:
                self.state = "half-open"
            else:
                raise CircuitOpenError()

        try:
            result = await func(*args)
            self.failures = 0
            self.state = "closed"
            return result
        except Exception:
            self.failures += 1
            if self.failures >= self.threshold:
                self.state = "open"
            raise

Source: Portkey Reliability Guide

Graceful Degradation

providers = [
    {"name": "claude-opus", "quality": "high"},
    {"name": "claude-sonnet", "quality": "medium"},
    {"name": "gpt-4o-mini", "quality": "low"},
]

for provider in providers:
    if circuit_breakers[provider["name"]].can_proceed():
        try:
            return await provider["client"].complete(prompt)
        except:
            continue

return {"error": "All providers unavailable", "fallback": True}

Summary: Choosing the Right Pattern

ProblemPatternCost ImpactWhen to Use
Critical info being ignoredAttention-aware orderingFreeAlways (default practice)
High API costsPrompt caching + model routing-50 to -90%Repeated calls, mixed workload
Inconsistent JSON outputStructured outputsSlight +When schema compliance is critical
Need high-stakes accuracySelf-consistency3-5xComplex reasoning, verifiable answers
Large knowledge baseRAG with hybrid searchVariable>200K tokens or dynamic content
Long tasks failing midwayCheckpointingFreeAny multi-step workflow
Provider outagesCircuit breakers + fallbacksFreeProduction reliability critical

Where to Start

  1. Do you have performance/cost problems?

    • No: Focus on free optimizations (attention ordering, checkpointing)
    • Yes: Measure bottleneck → prompt caching (if repeated calls) OR model routing (if mixed tasks)
  2. Do you have quality problems?

    • No: Build evaluation suite for baseline metrics
    • Yes: Deterministic tests first → self-consistency for critical paths → RAG if knowledge issue
  3. Do you have reliability problems?

    • Crashes: Add checkpointing
    • Provider failures: Add circuit breakers
    • Inconsistent outputs: Add structured outputs

Start with the free optimizations (attention ordering, checkpointing). Add caching for cost reduction. Layer in self-consistency and RAG for quality-critical applications only after measuring baseline performance.

Don’t optimize prematurely. Profile, measure, then optimize based on evidence.


This article synthesizes research from Stanford NLP, Google Research, OpenAI, Anthropic, LMSYS, Databricks, and Pinecone. All statistics and claims are verified against peer-reviewed sources or official documentation.