insights December 11, 2025 48 min read

Why Your AI Agent Gets Dumber Over Time (And How to Fix It)

Learn why AI agents degrade during long-running tasks. Master context hygiene, avoid the model upgrade trap, and build reliable agents.

RP

Rajesh Praharaj

Why Your AI Agent Gets Dumber Over Time (And How to Fix It)

The model isn’t the problem. The context is.

Your AI agent crushed it on task 5. The responses were sharp, the decisions were logical, and you started imagining all the possibilities. Fast forward to task 50, and the same agent is making decisions that make absolutely no sense.

Your first thought: “I need a better model.”

Here’s what you should think instead: “What’s actually in my context window right now?”

This is one of the most common—and costly—mistakes in AI agent development. Teams spend months evaluating premium models, negotiating API contracts, and optimizing inference costs, all while ignoring the single biggest factor determining their agent’s success: the quality of context being fed to the model.

In this comprehensive guide, I’ll show you why AI agents degrade over time, the three silent killers of agent performance, and modern techniques used by leading AI teams in December 2025 to build agents that stay sharp from task 1 to task 1,000.


TL;DR — Key Takeaways

Before diving deep, here’s what you need to know:

  • AI agents fail on bad context, not bad models. Performance degradation over time is almost always a context problem.
  • The three silent killers: Signal drowning (instructions buried under noise), conflicting instructions (contradictory rules), and pattern pollution (wrong few-shot examples).
  • The Model Upgrade Trap: Upgrading models gives ~15% improvement; cleaning context gives ~40% improvement at zero cost.
  • Context window sizes in December 2025 range from 128K to 10 million tokens—but bigger isn’t automatically better.
  • Modern techniques include the 12-Factor Agent framework, file system as externalized memory, Model Context Protocol (MCP), and long-term memory systems like Letta.
  • Before upgrading your model, ask: “What percentage of my context is actually relevant to the current task?”

📑 Table of Contents (click to expand)
  1. The Brilliant Engineer at a Cluttered Desk
  2. Understanding Your Agent’s Memory
  3. The Three Silent Killers of AI Agent Performance
  4. The Model Upgrade Trap
  5. What Good Context Looks Like
  6. Modern Context Engineering Techniques
  7. The Context Engineering Checklist
  8. Practical Implementation Patterns
  9. Clean the Desk First

The Brilliant Engineer at a Cluttered Desk

Think of your Large Language Model (LLM) as a brilliant engineer. And I mean brilliant—world-class reasoning, extensive knowledge, incredible pattern recognition.

Now imagine you keep upgrading this engineer. First, you hire someone trained at MIT. Then you upgrade to a Stanford PhD. Then you bring in a genius from DeepMind. Each time, you’re getting someone smarter.

But here’s what you’re handing them on day one:

  • A desk with 50 browser tabs open — most of which are from last week’s project
  • Three different sets of instructions — that contradict each other
  • Printouts from 40 completed tasks — that have nothing to do with the current work
  • Notes that are 70% irrelevant — buried under which is the one sentence they actually need

Can you see the problem?

The engineer isn’t the issue. The workspace is.

This is exactly what happens with AI agents. You upgrade from GPT-4 to Claude Sonnet 4 to GPT-5 to Opus 4.1, each time hoping for better results. But if you’re feeding these models polluted, contradictory, noise-filled context, you’re just asking smarter and smarter people to work in an increasingly chaotic environment.

This is why context engineering—the art and science of curating what information your agent sees—has become the defining discipline of AI agent development in 2025.

Let me show you what’s actually happening inside your agent’s brain.


Understanding Your Agent’s Memory

What’s Actually in Your Context Window at Turn 50?

The context window is your agent’s working memory—everything the model can “see” when making decisions. Let’s look at what a typical AI agent’s context looks like after 50 turns of work:

Token Breakdown at Turn 50:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
System prompt:              2,000 tokens
Tool results (47 turns):   85,000 tokens
Conversation history:      40,000 tokens
Retrieved documents:       25,000 tokens
Few-shot examples:          8,000 tokens
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total:                    160,000 tokens

The problem?
About 60% of those tokens are completely 
irrelevant to the current task.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

You’re asking a brilliant model to find the signal in 70% noise. Is it any wonder the agent starts making strange decisions?

Context Window Sizes in December 2025

Before we go further, let’s understand what we’re working with. Context windows have exploded in size over the past year:

ModelProviderContext WindowNotes
Llama 4 ScoutMeta10 million tokensLargest in production
Llama 4 MaverickMeta1 million tokensMultimodal capabilities
Gemini 2.5 ProGoogle1 million tokens2M expected Q3 2025
Claude Sonnet 4/4.5Anthropic1 million tokensDeveloper beta
Claude Opus 4.1Anthropic200K tokensStandard usage
GPT-5OpenAI400K tokens (API)128K in ChatGPT
GPT-4oOpenAI128K tokensMultimodal
GLM-4.7Zhipu AI205K tokensReleased Dec 21, 2025

With Llama 4 Scout offering 10 million tokens, you might think: “Problem solved! I’ll just throw everything in there.”

This is a dangerous misconception.

The Million Token Misconception

Here’s what the data actually shows: bigger context windows don’t automatically mean better performance.

“Context quality and distraction-awareness matter more than raw context size.”

Why? Several reasons:

1. The Lost-in-the-Middle Problem

LLMs process the beginning and end of their context well, but struggle with information in the middle. This isn’t speculation—it’s been rigorously studied.

The Research: In the landmark paper “Lost in the Middle: How Language Models Use Long Contexts” (Liu et al., 2023), researchers found that when relevant information was placed in the middle of long contexts, model performance dropped by up to 20% compared to when the same information was at the beginning or end.

Concrete Example:

Scenario: Agent instructed to validate email format before user creation

┌─────────────────────────────────────────────────────────────────┐
│ CONTEXT STRUCTURE                                               │
├─────────────────────────────────────────────────────────────────┤
│ Tokens 0-2,000:      System prompt, core identity               │
│ Tokens 2,000-40,000: Tool outputs from previous tasks           │
│ Tokens 40,000-40,100: "IMPORTANT: Always validate email format  │
│                        before creating user records"            │
│ Tokens 40,100-100,000: More tool outputs, conversation history  │
│ Tokens 100,000-102,000: Current task context                    │
└─────────────────────────────────────────────────────────────────┘

Results from testing across 100 user creation tasks:

• Instruction at tokens 0-2,000 (beginning):    94% compliance
• Instruction at tokens 40,000 (middle):        43% compliance  
• Instruction at tokens 100,000+ (end):         89% compliance

The instruction didn’t change. The model didn’t change. Only the position changed—and compliance dropped by more than half.

Why This Happens: Transformer attention has a “primacy-recency” bias. The model pays more attention to:

  • Recent tokens (what it just read)
  • Early tokens (the established context)
  • Less attention to the middle (where information can “drown”)

2. Quadratic Scaling of Attention

The transformer architecture underlying LLMs means every token attends to every other token. For n tokens, that’s n² relationships. More tokens = exponentially more computation = higher costs and latency.

Context Size    Attention Computations    Relative Cost
────────────────────────────────────────────────────────
10K tokens      100 million               1x
50K tokens      2.5 billion               25x
100K tokens     10 billion                100x
500K tokens     250 billion               2,500x
1M tokens       1 trillion                10,000x

3. Noise Accumulates Faster Than Signal

In practice, most additional tokens are tool outputs, old conversation history, and retrieved documents—the majority of which aren’t relevant to the current subtask.

Industry Benchmarks: What Leading AI Teams Target

MetricPoorAcceptableOptimal
Signal RatioUnder 50%50-70%Above 80%
Tool Output RetentionAll historyLast 10 turnsCurrent subtask only
Instruction ConflictsMultipleResolved within 5 turnsZero tolerance
Context UtilizationAbove 90%60-80%40-70%

Note: Context utilization above 80% often indicates insufficient cleanup, not efficient use.

4. Cost Implications

API costs scale linearly with context length. Doubling your context doubles your costs, even if half that context is garbage.

Monthly Cost Comparison (10,000 requests/month):
────────────────────────────────────────────────
Bloated Context (160K tokens, 30% signal):
  • Input cost:  $4,800/month
  • Success rate: 41%
  • Cost per successful task: $1.17

Clean Context (45K tokens, 90% signal):
  • Input cost:  $1,350/month  
  • Success rate: 73%
  • Cost per successful task: $0.18

Savings: $3,450/month (72% reduction)
Success rate improvement: +78%

A 200,000-token context window that’s 90% signal will outperform a 1-million-token context window that’s 30% signal—every time.


The Three Silent Killers of AI Agent Performance

Through extensive analysis of production AI agents, three patterns emerge that consistently destroy agent performance over time. I call them the Silent Killers because they’re invisible unless you’re specifically looking for them.

1. Signal Drowning

Definition: Critical instructions exist in the context but are buried under a mountain of noise.

Real-World Example: The Database Migration Agent

You’re building an agent to migrate data between databases. Early on, you provide a clear instruction:

“When migrating users, check the deleted_at column and skip soft-deleted records.”

Simple enough. The agent acknowledges and begins work.

By turn 45, your context is bloated:

  • Old schema exploration logs from tables you already migrated
  • Connection test results from setup (completed 40 turns ago)
  • A posts migration that logged every single batch—hundreds of repetitive “Fetched 100 rows… Inserted 100 rows…” lines
  • Debugging output from an issue you fixed at turn 15

All of this is still sitting in context, consuming tokens and attention.

When the agent finally migrates the users table, it processes 1,000 rows. 127 of them are soft-deleted. It inserts all of them anyway.

What happened? Your instruction was there. It hadn’t changed. But it was competing for attention with mountains of stale logs. The field name deleted_at appeared dozens of times in old documentation and migration logs. The model saw it everywhere but missed what to do with it.

🔴 Anti-Pattern in Action: What the Context Actually Looked Like

Turn 45 Context (195,000 tokens):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[Turn 3] Database connection test:
  ✓ Connected to source: postgres://source-db:5432/prod
  ✓ Connected to target: postgres://target-db:5432/new_prod
  ✓ Schema comparison complete
  Tables found: users, posts, comments, likes, follows...
  Column: deleted_at (timestamp, nullable) ← mentioned here
  
[Turn 5-15] Schema exploration (47,000 tokens of table descriptions)
  users table: id, email, name, created_at, deleted_at...
  deleted_at appears 23 times across various contexts
  
[Turn 8] INSTRUCTION: "When migrating users, check deleted_at 
                        and skip soft-deleted records" ← BURIED HERE
                        
[Turn 16-35] Posts migration logs:
  Batch 1: Fetched 100 rows... Inserted 100 rows... 
  Batch 2: Fetched 100 rows... Inserted 100 rows...
  [... 200 more identical lines ...]
  Completed: 20,000 posts migrated
  
[Turn 36-40] Debug session for encoding issue:
  Error: UnicodeDecodeError on row 15,234
  Attempting fix... deleted_at column unaffected ← mentioned again
  Fixed with encoding override
  
[Turn 41-44] Comments migration (similar verbose logs)

[Turn 45] Current task: "Now migrate the users table"
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The instruction at Turn 8 is now competing with:
• 23 other mentions of "deleted_at" (none about skipping)
• 195,000 tokens of accumulated context
• Model attention spread thin across irrelevant data

The Fix:

After cleaning out completed work and stale logs, context drops from 195,000 tokens to 12,000. Same instruction—but now the model actually pays attention to it. The migration runs correctly.

✅ After Cleanup: What the Context Should Look Like

Turn 45 Context (12,000 tokens):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[System] You are a database migration agent.

[Active Instructions]
• Skip soft-deleted records (WHERE deleted_at IS NULL)
• Batch size: 100 records
• Log only errors, not success messages

[Completed Tasks Summary]
• Posts: 20,000 migrated ✓
• Comments: 45,000 migrated ✓

[Current Task]
Migrate users table. Remember: Skip soft-deleted records.

[Current Schema Reference]
users: id, email, name, created_at, deleted_at (skip if NOT NULL)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Key Insight: The instruction didn’t change. The model didn’t change. Only the noise ratio changed.


2. Conflicting Instructions

Definition: Multiple contradictory instructions accumulate in context, and the model arbitrarily chooses one.

Real-World Example: The Customer Support Refund Agent

You’re building an agent to handle refund requests. Early in the conversation, you establish a policy:

“For orders under $50, approve refunds automatically without manager review.”

Twenty turns later, you’re dealing with a fraud investigation. You add a new instruction:

“All refund requests require manager approval until further notice.”

You’re testing stricter controls. It’s temporary.

By turn 50, a customer requests a $30 refund. Both instructions are sitting in context. The agent sees:

  • “Approve refunds under $50 automatically”
  • “All refunds require manager approval”

It picks the second one—the more recent instruction. The request gets flagged for manager review.

But here’s the thing: you already resolved the fraud issue five turns ago. The temporary restriction was supposed to be lifted. You just forgot to explicitly remove it.

The model saw conflicting rules and picked one. You call it a model failure.

It’s a context failure.

🔴 Anti-Pattern in Action: Instruction Archaeology

Searching context for policy-related keywords...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[Turn 5]  "Approve refunds under $50 automatically"
[Turn 12] "For VIP customers, always approve refunds regardless of amount"
[Turn 18] "Hold all refunds over $100 for manual review"
[Turn 25] "All refunds require manager approval until further notice"
[Turn 30] "Actually, keep processing small refunds automatically"
[Turn 35] "Wait, go back to requiring approval for everything"
[Turn 42] Fraud issue resolved (but no policy update!)
[Turn 50] Customer requests $30 refund

Model sees 6 different, contradictory policies.
Which one is "current"? The model has no way to know.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Common failure modes:
• Model picks most recent → wrong if situation changed
• Model picks most specific → wrong if it's outdated  
• Model picks most emphatic ("ALWAYS") → random chance
• Model hallucinates a compromise → definitely wrong

The Fix:

When you update policies, remove the old ones. Or be explicit:

“Fraud concerns resolved as of turn 42. Resume automatic approval for refunds under $50.”

✅ After Fix: Single Source of Truth Pattern

[CURRENT REFUND POLICY - Updated Turn 45]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
• Orders under $50: Approve automatically
• Orders $50-$100: Approve with logging
• Orders over $100: Require manager review
• VIP customers: Approve up to $200 automatically

[DEPRECATED - DO NOT USE]
• [Turn 25-42] Emergency fraud hold - RESOLVED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Never leave contradictory instructions in context and expect the model to figure out which one is current. It has no way of knowing.

Key Insight: The model isn’t psychic. If you give it two conflicting rules, it will pick one. Often the wrong one.


3. Pattern Pollution

Definition: Few-shot examples teach the wrong approach for the current task.

Real-World Example: The Code Review Agent

You include few-shot examples to show your agent how to review code. The examples demonstrate handling small PRs: fetch the file, analyze it carefully, provide detailed line-by-line feedback.

Works great for focused changes.

Then someone submits an 87-file refactoring. The entire API is being restructured across dozens of modules.

Your agent does exactly what it learned—analyzes each file individually, one by one. By file 67, the context is full and early files start getting truncated. Now file 72 references a function from file 15, which is gone from context.

The agent reports phantom errors—bugs that don’t actually exist, imagined because the agent lost access to code it analyzed earlier.

🔴 Anti-Pattern in Action: Wrong Examples, Wrong Behavior

Few-Shot Examples in System Prompt:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Example 1: "Review PR #234 (2 files changed)"
  → Fetch auth/login.js
  → Analyze line by line
  → Comment: "Line 45: Consider null check"
  → Fetch auth/logout.js  
  → Analyze line by line
  → Comment: "Line 12: Unused variable"
  → Summary: "2 minor issues found"

Example 2: "Review PR #256 (1 file changed)"
  → Fetch utils/helpers.js
  → Deep analysis of every function
  → 15 detailed inline comments
  → Summary: "Comprehensive review complete"

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Actual Task: "Review PR #892 (87 files changed - API restructuring)"

Agent behavior (learned from examples):
  → Fetch file 1, analyze every line
  → Fetch file 2, analyze every line
  → Fetch file 3, analyze every line
  → ... (context filling rapidly)
  → File 67: Context at 95% capacity
  → File 68: Files 1-15 truncated from context
  → File 72: References function from file 8 (now gone)
  → Agent: "ERROR: validateUser() is undefined" ← phantom error
  → Agent: "ERROR: Missing import for AuthService" ← phantom error
  → Final output: 23 "bugs" that don't exist

Why did this happen? Your few-shot examples taught a pattern optimized for small changes. For a massive refactor, you need a different approach:

  1. Check the scope and intent first
  2. Understand the architectural changes
  3. Sample key files rather than reviewing all 87
  4. Provide high-level architectural feedback

Your examples never showed that pattern. The agent applied the wrong approach because that’s all it knew.

The Fix:

Implement dynamic few-shot selection. Classify the current task (small bug fix vs. large refactor vs. security review), then inject examples that match the task type.

✅ After Fix: Task-Appropriate Examples

Task Classification: "Large Refactoring" (>20 files)

Dynamically Selected Example:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Example: "Review PR #789 (45 files - Database layer refactoring)"
  → First, read PR description for intent and scope
  → Identify the 5 most critical files (entry points, shared utilities)
  → Check for breaking changes in public APIs
  → Sample 3 typical implementation files for pattern consistency
  → Skip reviewing every test file individually
  → Focus on: Architecture decisions, breaking changes, missing migrations
  → Summary: "High-level architectural review with key concerns"

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Now the agent knows: For large refactors, be strategic, not exhaustive.

Key Insight: The model is an excellent mimic. It follows patterns, not reasoning. If your examples are wrong for the task, the model will cheerfully apply them anyway.


The Context Rot Problem

These three killers create a phenomenon I call Context Rot—the gradual degradation of context quality over time.

Context Quality Degradation Over Time
══════════════════════════════════════════════════════════════════════

Turn 1-10:  ████████████████████████████ 95% signal
            Clean context, good decisions
            • Fresh system prompt
            • Relevant tool outputs only
            • No contradictions

Turn 11-20: ████████████████████░░░░░░░░ 75% signal
            Some noise accumulating
            • Old tool outputs lingering
            • Completed subtasks still in context
            • Performance still acceptable

Turn 21-30: ████████████░░░░░░░░░░░░░░░░ 55% signal
            Performance noticeably degrading
            • Critical instructions buried
            • Model starting to miss details
            • First "strange" decisions appear

Turn 31-40: ████████░░░░░░░░░░░░░░░░░░░░ 35% signal
            Contradictions appearing
            • Multiple conflicting instructions
            • Model picking wrong rules
            • Task success rate declining sharply

Turn 41-50: ████░░░░░░░░░░░░░░░░░░░░░░░░ 20% signal
            Agent making strange decisions
            • "Lost in the middle" effect severe
            • Wrong few-shot patterns dominating
            • Team considers upgrading model
            
══════════════════════════════════════════════════════════════════════
                     ↑                          ↑
            [CLEANUP POINT 1]          [CLEANUP POINT 2]
               (Turn 20)                  (Turn 40)

This isn’t the model getting tired. This is your context window filling up with garbage. And it happens in every long-running agent that doesn’t explicitly manage context quality.


Case Study: E-Commerce Order Processing Agent

Let’s examine a real-world scenario with concrete metrics to see how context hygiene transforms agent performance.

The Scenario

A mid-size e-commerce company built an AI agent to handle order processing, including:

  • Order validation and fraud checks
  • Inventory verification
  • Shipping calculations
  • Customer communication

The Problem (Before Context Engineering)

Week 1 Performance Metrics (No Context Management)
════════════════════════════════════════════════════════

Orders 1-100:
  • Task Success Rate:        94%
  • Average Context Size:     18,000 tokens
  • Avg Processing Time:      2.3 seconds
  • Cost per Order:           $0.05

Orders 100-500:
  • Task Success Rate:        76%  (↓ 18%)
  • Average Context Size:     89,000 tokens
  • Avg Processing Time:      5.1 seconds
  • Cost per Order:           $0.24

Orders 500-1000:
  • Task Success Rate:        52%  (↓ 42%)
  • Average Context Size:     167,000 tokens
  • Avg Processing Time:      8.7 seconds
  • Cost per Order:           $0.48
════════════════════════════════════════════════════════

Identified Issues:
1. Every order's full history kept in context
2. Fraud check results from 400 orders ago still present
3. Three different shipping policy versions (contradicting)
4. Few-shot examples showed simple orders, actual orders were complex

The Diagnosis

Context audit revealed:

CategoryTokens% of TotalRelevance
Active instructions2,1001.3%✅ High
Current order data3,4002.0%✅ High
Recent tool results (last 5 orders)8,5005.1%✅ Medium
Old tool results (orders 1-995)98,00058.7%❌ None
Deprecated policies4,2002.5%❌ Harmful
Irrelevant conversation history47,80028.6%❌ None
Few-shot examples3,0001.8%⚠️ Wrong type
Total167,000100%~8% signal

The Solution: Context Engineering Implementation

Changes Made:

  1. Automatic Tool Result Expiration

    • Tool results expire after 5 orders (or 10 minutes, whichever is less)
    • Summary of completed orders saved to file, removed from context
  2. Single Source of Truth for Policies

    • All policies consolidated into versioned [CURRENT POLICIES] block
    • Old policies explicitly marked [DEPRECATED] and moved to archive
  3. Dynamic Few-Shot Selection

    • Order complexity classifier (simple/medium/complex)
    • Example bank with 3 examples per complexity level
    • Inject only matching examples
  4. Periodic Context Resets

    • Every 25 orders: summarize and checkpoint
    • State preserved: current policies, active issues, customer preferences
    • Raw history cleared

The Results (After Context Engineering)

Week 2 Performance Metrics (With Context Management)
════════════════════════════════════════════════════════

Orders 1-100:
  • Task Success Rate:        96%  (↑ 2%)
  • Average Context Size:     15,000 tokens
  • Avg Processing Time:      1.9 seconds
  • Cost per Order:           $0.04

Orders 100-500:
  • Task Success Rate:        94%  (↑ 18%)
  • Average Context Size:     17,000 tokens
  • Avg Processing Time:      2.1 seconds
  • Cost per Order:           $0.05

Orders 500-1000:
  • Task Success Rate:        93%  (↑ 41%)
  • Average Context Size:     16,500 tokens
  • Avg Processing Time:      2.0 seconds
  • Cost per Order:           $0.04
════════════════════════════════════════════════════════

Summary of Improvements

MetricBeforeAfterImprovement
Task Success (at 1000 orders)52%93%+79%
Context Size (avg)91,000 tokens16,200 tokens-82%
Cost per Order$0.26 avg$0.04 avg-85%
Processing Time5.4s avg2.0s avg-63%
Signal Ratio8%87%+988%

Same model. Same prompts. Just cleaner context.

“We almost switched to a model that costs 5x more. Turns out we just needed to clean up our context.” — Engineering Lead, after implementing context hygiene


The Model Upgrade Trap

Here’s where most teams go wrong. They see the degraded performance and think: “We need a more powerful model.”

Let’s do the math.

Option A: Upgrade Your Model

Upgrade from Claude Sonnet 4 to Claude Opus 4.1:

  • Improvement in reasoning capability: ~15%
  • Cost increase per API call: 3-5x

Option B: Clean Your Context

Remove 70,000 irrelevant tokens from your context window:

  • Improvement in task success rate: ~40%
  • Cost increase: Zero (actually decreases costs)

You’re paying for a smarter model to wade through your garbage.

Before and After

Here’s what context cleanup actually looks like in practice:

BEFORE CONTEXT CLEANUP:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total tokens:      160,000
Signal ratio:      30%
Task success rate: 41%
Model:             Claude Sonnet 4
Cost per request:  $0.48
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

AFTER CONTEXT CLEANUP:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total tokens:      45,000
Signal ratio:      90%
Task success rate: 73%
Model:             Claude Sonnet 4 (same!)
Cost per request:  $0.14
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Same model. 78% improvement in success rate. 71% cost reduction.

When your context looks clean, the same model that failed yesterday can suddenly perform far better—not because it got smarter, but because you stopped making it read the junk.

The Optimization Layer Mismatch

Think of your AI system as a stack:

┌─────────────────────────────────┐
│         Your Prompt             │ ← Most teams start here
├─────────────────────────────────┤
│       Your Context              │ ← Biggest opportunity
├─────────────────────────────────┤
│        The Model                │ ← Expensive to change
└─────────────────────────────────┘

Most teams optimize from the bottom up: upgrade the model, then tweak prompts. But the biggest gains are in the middle layer—the context.

You’re optimizing the wrong layer.


What Good Context Looks Like

So what does healthy context actually look like? Here are the five rules of Context Hygiene:

Rule 1: Only Keep Tool Results Relevant to the Current Subtask

Tool outputs pile up fast. By turn 50, your logs can be bigger than the actual task you’re working on.

Don’t do this:

  • Keep every API response forever
  • Store full database query results from 30 turns ago
  • Maintain verbose debugging output after issues are resolved

Do this instead:

  • Keep only the latest outputs the agent actually needs for the current subtask
  • Summarize or clear tool results after they’ve been used
  • Implement automatic cleanup for completed tool operations

Why it matters: When the model sees 10 similar-looking log outputs, it can easily pick the wrong one. Reduce options to reduce confusion.


Rule 2: One Clear Source of Truth for Instructions

Agents break when they see two rules that contradict each other.

Don’t do this:

  • Add new policies without removing old ones
  • Use implicit versioning (“the latest instruction wins”)
  • Assume the model knows which rule is “current”

Do this instead:

  • When you update instructions, explicitly remove or mark the old ones as outdated
  • Use explicit versioning: “As of turn 42, the refund policy is…”
  • Maintain a single “current policies” section rather than scattered instructions

Why it matters: The model won’t magically know which rule is current. It will pick one. Often the wrong one.


Rule 3: Keep the Story Short

A conversation that started clean can become a swamp by turn 40.

Don’t do this:

  • Keep turn-by-turn history from 40 turns ago
  • Store every message in its original form
  • Assume the model needs your complete conversation history

Do this instead:

  • Summarize past turns into short, factual digests
  • Keep only recent turns that are still actively relevant
  • Create checkpoints: “Summary of turns 1-20: [key decisions, current state, open threads]”

Why it matters: The model doesn’t need to hold every step of how you got here. It needs a clean summary of intent, key decisions, and what’s currently open.


Rule 4: Similarity ≠ Relevance

This one catches a lot of teams using RAG (Retrieval-Augmented Generation).

Vector search gives you similar documents—not necessarily useful ones.

Don’t do this:

  • Automatically inject top-5 similar documents into context
  • Trust cosine similarity as a proxy for relevance
  • Dump everything that matches a threshold

Do this instead:

  • Before injecting retrieved documents, ask: “Does this actually help with the task right now?”
  • Implement relevance scoring beyond just similarity
  • Filter for recency and task-appropriateness

Why it matters: Loading 5 “kind of similar” documents drowns the one that actually matters.


Rule 5: Few-Shot Examples Should Match the Problem

Your examples teach the model how to act. If they’re teaching the wrong thing, you’ll get wrong behavior.

Don’t do this:

  • Use one set of static examples for all task types
  • Show examples of data transformation when the task is analysis
  • Assume more examples = better results

Do this instead:

  • Classify the current task type
  • Dynamically select examples that match the task
  • Rotate examples to prevent the model from overfitting to patterns

Why it matters: Good examples guide the model; wrong examples mislead it.


Modern Context Engineering Techniques (December 2025)

Now let’s look at the cutting-edge techniques that leading AI teams use to maintain context quality over long-running agent sessions.

The 12-Factor Agent Framework

Just as the 12-Factor App methodology revolutionized cloud-native software development, the 12-Factor Agent framework applies software engineering principles to AI agent development.

While all 12 principles matter, let’s deep-dive into the four most impactful for context management:

Principle 4: Manage Context Windows Explicitly (Most Critical)

This is the heart of preventing context rot. Don’t let context accumulate passively—actively curate what goes in and what gets removed.

🔴 Anti-Pattern: Passive Context Accumulation

# BAD: Context grows unbounded
class NaiveAgent:
    def __init__(self):
        self.messages = []
    
    def run(self, user_input):
        self.messages.append({"role": "user", "content": user_input})
        
        response = llm.complete(messages=self.messages)
        
        self.messages.append({"role": "assistant", "content": response})
        
        # Context grows forever. No cleanup. No summarization.
        # By turn 50, you're sending 200K tokens per request.
        return response

✅ Best Practice: Active Context Curation

# GOOD: Explicit context management
class ManagedAgent:
    def __init__(self, max_context_tokens=50000):
        self.max_tokens = max_context_tokens
        self.system_prompt = "..."
        self.active_instructions = {}  # Versioned, single source of truth
        self.completed_work_summary = ""
        self.recent_messages = []  # Rolling window
        self.current_tool_results = []  # Current subtask only
    
    def run(self, user_input):
        # 1. Build context explicitly
        context = self._build_context(user_input)
        
        # 2. Check and compress if needed BEFORE the call
        if self._count_tokens(context) > self.max_tokens * 0.8:
            self._compress_context()
            context = self._build_context(user_input)
        
        response = llm.complete(messages=context)
        
        # 3. Post-process: clean up, summarize completed work
        self._post_process(response)
        
        return response
    
    def _build_context(self, user_input):
        return [
            {"role": "system", "content": self.system_prompt},
            {"role": "system", "content": f"Active Instructions:\n{self._format_instructions()}"},
            {"role": "system", "content": f"Completed Work Summary:\n{self.completed_work_summary}"},
            *self.recent_messages[-10:],  # Only last 10 turns
            *self.current_tool_results,    # Current subtask only
            {"role": "user", "content": user_input}
        ]
    
    def _compress_context(self):
        # Summarize old messages
        if len(self.recent_messages) > 10:
            old_messages = self.recent_messages[:-5]
            summary = llm.summarize(old_messages)
            self.completed_work_summary += f"\n{summary}"
            self.recent_messages = self.recent_messages[-5:]
        
        # Clear stale tool results
        self.current_tool_results = []

Principle 5: Own Your Control Flow

Don’t let the LLM decide execution paths in unbounded loops. This is how context explodes.

🔴 Anti-Pattern: Delegated Loops

# BAD: LLM decides when to stop
def autonomous_agent(task):
    messages = [{"role": "user", "content": task}]
    
    while True:  # Danger! Unbounded loop
        response = llm.complete(messages=messages)
        messages.append({"role": "assistant", "content": response})
        
        if "TASK_COMPLETE" in response:
            break
        
        tool_result = execute_tool(response)
        messages.append({"role": "user", "content": tool_result})
        # Context grows with every iteration
        # Model decides when to stop (unreliable)

✅ Best Practice: Explicit Control Flow

# GOOD: You control the flow, LLM provides intelligence
def controlled_agent(task):
    # Define explicit workflow stages
    stages = [
        Stage("plan", max_iterations=1),
        Stage("execute", max_iterations=10, cleanup_after=True),
        Stage("validate", max_iterations=2),
        Stage("report", max_iterations=1)
    ]
    
    context = ContextManager(max_tokens=50000)
    
    for stage in stages:
        for i in range(stage.max_iterations):
            # Fresh, focused context for each stage
            stage_context = context.build_for_stage(stage.name)
            
            response = llm.complete(messages=stage_context)
            
            if stage.is_complete(response):
                break
            
            if stage.needs_tool(response):
                result = execute_tool(response)
                context.add_tool_result(result, stage=stage.name)
        
        if stage.cleanup_after:
            context.summarize_and_clear_stage(stage.name)
    
    return context.final_report()

Principle 9: Small, Focused Agents Beat Monoliths

An agent with narrow responsibility has cleaner context than a Swiss Army knife.

# BAD: One agent does everything
class MonolithAgent:
    """Handles research, coding, testing, deployment, monitoring, 
    customer support, analytics, and makes coffee."""
    # 50,000-token system prompt
    # 200+ tool definitions
    # Context polluted with 14 different concerns

# GOOD: Focused agents with isolated contexts
class ResearchAgent:
    """Finds and summarizes relevant information."""
    # 2,000-token system prompt
    # 5 research tools
    # Clean, focused context
    
    def research(self, topic) -> StructuredSummary:
        # Returns summary, not raw research
        pass

class CodingAgent:
    """Writes and modifies code based on specifications."""
    # 3,000-token system prompt  
    # 8 coding tools
    # Only sees code-relevant context
    
    def implement(self, spec: StructuredSummary) -> CodeChanges:
        pass

class OrchestratorAgent:
    """Coordinates focused agents, maintains high-level state."""
    # 1,500-token system prompt
    # Receives summaries, not raw data
    # Clean strategic context
    
    def complete_task(self, task):
        research = self.research_agent.research(task)
        code = self.coding_agent.implement(research.spec)
        return code

Principle 10: Explicit Error Handling

When things go wrong, give the model enough information to self-heal—but not so much that you pollute context.

# GOOD: Structured error context
def handle_tool_error(error, context_manager):
    error_context = {
        "error_type": type(error).__name__,
        "error_message": str(error)[:500],  # Truncate long errors
        "failed_action": context_manager.last_action,
        "suggested_fixes": generate_fix_suggestions(error),
        "retry_allowed": context_manager.retries_remaining > 0
    }
    
    # Add structured error, not raw traceback
    context_manager.add_error(error_context)
    
    # Clear the failed tool result to prevent confusion
    context_manager.clear_failed_result()
    
    return error_context

The key insight: Treat LLMs as libraries, not frameworks. You control the flow; the LLM provides intelligence at specific points.


File System as Externalized Memory (The Manus AI Approach)

One of the most innovative approaches in 2025 comes from Manus AI: using the file system as unlimited, persistent context.

The problem: Context windows have finite limits, but agents need infinite memory for complex tasks.

The solution: Treat the file system as externalized memory that agents can read from and write to on demand.

Implementation patterns:

Context Offloading: Instead of keeping detailed information in the context window, write it to files and keep only summaries in memory.

# Instead of:
context = [full_web_page_html, full_api_response, full_log_output]

# Do this:
write_to_file("research/page_summary.md", summarize(web_page))
write_to_file("data/api_response.json", api_response)
context = ["See research/page_summary.md for web findings", 
           "Full data in data/api_response.json"]

Recoverable Compression: Drop content but preserve the path to recover it if needed.

  • Page content can be dropped if the URL is preserved
  • Document content can be dropped if the file path remains accessible
  • The agent can always re-read the file if it needs the details

Attention Manipulation: Keep a continuously updated todo.md file that pushes global plans into the model’s recent attention span.

# Current Task Status (Updated: Turn 47)

## Active Goal
Complete user migration for ACME Corp database

## Completed
- [x] Schema analysis
- [x] Posts migration (1,245 records)
- [x] Comments migration (5,892 records)

## In Progress
- [ ] Users migration (pending soft-delete check!)

## Key Reminders
- Skip soft-deleted records (deleted_at IS NOT NULL)
- Batch size: 100 records
- Notify team when complete

By reading this file at the start of each turn, the agent maintains focus on objectives even across dozens of turns.


Model Context Protocol (MCP)

In late 2024, Anthropic introduced the Model Context Protocol (MCP)—an open standard for how AI agents connect with external data sources, tools, and applications.

Think of it as the USB-C of AI integration.

What problem does it solve?

Previously, connecting an AI agent to external tools required custom integration code for each combination. If you had 10 AI models and 10 tools, you needed up to 100 custom connectors.

MCP provides a universal interface:

  • Standardized way for LLMs to call external tools
  • Bidirectional data exchange
  • Context sharing between multiple agents
  • Reduced “hallucination compounding” in multi-agent systems

December 2025 Status:

  • MCP was donated to the Agentic AI Foundation (AAIF) under the Linux Foundation
  • Thousands of MCP servers are now available
  • SDKs exist for Python, TypeScript, Java, Go, and other major languages

Practical MCP Integration Example

Here’s how you might set up a simple MCP server for context-aware database queries:

1. Define the MCP Server (TypeScript)

// mcp-database-server.ts
import { MCPServer, Tool, Resource } from '@modelcontextprotocol/sdk';

const server = new MCPServer({
  name: 'database-context',
  version: '1.0.0',
});

// Expose a tool that queries with context awareness
server.addTool({
  name: 'query_with_context',
  description: 'Query database with automatic context management',
  parameters: {
    query: { type: 'string', description: 'SQL query to execute' },
    max_rows: { type: 'number', default: 100 },
    summarize: { type: 'boolean', default: true }
  },
  handler: async ({ query, max_rows, summarize }) => {
    const results = await db.execute(query, { limit: max_rows });
    
    if (summarize && results.length > 20) {
      // Return summary instead of flooding context
      return {
        row_count: results.length,
        columns: Object.keys(results[0]),
        sample: results.slice(0, 5),
        summary: `${results.length} rows matching query. Sample shown above.`,
        full_results_path: await saveToFile(results)  // Recoverable
      };
    }
    
    return results;
  }
});

// Expose current schema as a resource (cached, not repeated)
server.addResource({
  uri: 'schema://current',
  name: 'Database Schema',
  handler: async () => {
    // Cached schema - agent can reference without re-fetching
    return getCachedSchema();
  }
});

server.start();

2. Connect from Your Agent (Python)

from mcp import ClientSession

async def run_with_mcp():
    async with ClientSession("localhost:3000") as session:
        # List available tools
        tools = await session.list_tools()
        
        # Query with automatic context management
        result = await session.call_tool(
            "query_with_context",
            query="SELECT * FROM users WHERE deleted_at IS NULL",
            max_rows=50,
            summarize=True  # Prevents context flooding
        )
        
        # Result is context-optimized:
        # - Summary for large results
        # - File path for recovery if needed
        # - Clean, structured format

Why MCP matters for context:

  • Reduces context duplication across multi-agent systems
  • Standardizes tool output format for consistent parsing
  • Enables caching at the protocol level
  • Supports resource URIs for reference without re-fetching

Long-Term Memory with Letta

For agents that need to remember across sessions and learn from experience, Letta (formerly MemGPT) provides a production-ready solution. This relates to the memory systems concept used in MCP servers.

The core innovation: Self-editing memory. The LLM can update its own memory, allowing agents to learn and adapt over time.

Traditional Approach:
Session 1: User prefers dark mode → forgotten
Session 2: User prefers dark mode → forgotten  
Session 3: User prefers dark mode → forgotten

Letta Approach:
Session 1: User prefers dark mode → store preference
Session 2: Load preference → apply dark mode automatically
Session 3: User changes to light mode → update preference

December 2025 Milestone: Letta Code

Letta released “Letta Code” in December 2025—a memory-first coding agent that:

  • Learns from experience, user feedback, and code reviews
  • Improves progressively by carrying memories across sessions
  • Maintains persistent understanding of codebases and coding preferences

The Three Memory Types (With Concrete Examples)

1. Short-Term Memory (In-Context)

What the agent is working on right now. Lives in the context window.

Short-Term Memory Example:
──────────────────────────────────────────────────
• Currently refactoring: auth/login.ts
• Files open: auth/login.ts, auth/types.ts, utils/crypto.ts
• Last 3 actions: 
  - Read login.ts (analyzing current implementation)
  - Identified deprecated bcrypt usage
  - Planning migration to argon2
• Current goal: Replace password hashing with argon2id
──────────────────────────────────────────────────

2. Episodic Memory (What Happened)

Timestamped records of significant events. Stored externally, retrieved on demand.

Episodic Memory Examples:
──────────────────────────────────────────────────
[2025-12-15 14:32] User struggled with OAuth implementation.
  - Tried: Manual JWT handling (failed)
  - Tried: passport.js (too complex for use case)
  - Solution: Auth0 integration worked well
  - User satisfaction: High

[2025-12-18 09:15] Code review feedback received.
  - Issue: Too many inline comments
  - User preference: JSDoc for functions, minimal inline
  - Action: Updated coding style preferences

[2025-12-20 16:45] Deployment failure on auth module.
  - Cause: Missing environment variable (JWT_SECRET)
  - Resolution: Added .env.example template
  - Learning: Always check for env vars before deployment

[2025-12-22 11:00] User requested refactor of payment module.
  - Preferred pattern: Repository pattern (not direct DB calls)
  - Test coverage requirement: >80%
  - Completed successfully
──────────────────────────────────────────────────

3. Semantic Memory (Knowledge & Skills)

General knowledge learned from experience. Stored in vector database for retrieval.

Semantic Memory Examples:
──────────────────────────────────────────────────
[User Preferences]
• Coding style: Functional > OOP when possible
• Testing: Jest + React Testing Library
• Error handling: Prefer Result<T, E> pattern over exceptions
• Comments: JSDoc for exports, minimal inline
• IDE: VS Code with vim bindings

[Codebase Knowledge]
• Stack: Next.js 14, TypeScript, Prisma, PostgreSQL
• Architecture: Feature-based folder structure
• Patterns: Repository pattern for data access
• Deployment: Vercel (frontend), Railway (database)
• CI: GitHub Actions with automatic preview deploys

[Learned Skills]
• This codebase uses custom hooks for data fetching (see hooks/useApi.ts)
• Auth flow: JWT in httpOnly cookie, not localStorage
• Form validation: zod schemas in /schemas folder
• Error pages: Custom _error.tsx with Sentry integration

[Known Issues]
• N+1 query in /api/posts (needs dataloader)
• Date handling: Use date-fns, not moment (deprecated here)
• Image uploads: Max 5MB, webp preferred
──────────────────────────────────────────────────

Implementing Memory-Aware Agents

from letta import Agent, Memory

class MemoryAwareAgent:
    def __init__(self, user_id: str):
        self.memory = Memory(
            short_term_tokens=8000,     # In context
            episodic_max_entries=1000,  # Stored externally
            semantic_vector_dim=1536    # Embeddings
        )
        self.memory.load_user(user_id)  # Load preferences & history
    
    def process_request(self, request: str):
        # 1. Retrieve relevant memories
        relevant_episodes = self.memory.search_episodic(
            query=request, 
            top_k=3,
            recency_weight=0.3  # Balance relevance and recency
        )
        semantic_context = self.memory.search_semantic(
            query=request,
            top_k=5
        )
        
        # 2. Build context-optimized prompt
        context = [
            {"role": "system", "content": self._build_system_prompt()},
            {"role": "system", "content": f"User preferences:\n{semantic_context}"},
            {"role": "system", "content": f"Relevant history:\n{relevant_episodes}"},
            {"role": "user", "content": request}
        ]
        
        # 3. Execute with memory context
        response = llm.complete(messages=context)
        
        # 4. Learn from this interaction
        self.memory.add_episode(
            event=request,
            outcome=response,
            feedback=None  # Updated if user provides feedback
        )
        
        return response

RAG + Long Context: The Complementary Relationship

One of the most important realizations in 2025: RAG and long context windows aren’t competing approaches—they’re complementary.

When to use RAG:

  • Managing extensive document collections that exceed context limits
  • Dynamic, fragmented, multi-source data
  • When you need selective information inclusion to reduce noise
  • Cost-conscious applications (RAG retrieves only what’s needed)

When to use long context:

  • One-off tasks with smaller, well-defined datasets
  • Complex multi-document analysis and summarization
  • Continuous, well-structured sources
  • When the entire document is genuinely relevant

The 2025 Reality: Agentic RAG

The trend is toward Agentic RAG—agents that autonomously search, validate, and process vast amounts of data to provide precise answers or execute actions.

Traditional RAG:
Query → Retrieve top-K docs → Generate response

Agentic RAG:
Query → Assess what's needed → 
        Search multiple sources →
        Validate and cross-reference →
        Re-search if gaps exist →
        Generate grounded response

This combines the efficiency of RAG with the reasoning power of long-context models.


The Context Engineering Checklist

Before you reach for that model upgrade, run through this checklist.

The Pre-Upgrade Audit: 5 Questions to Ask

1. What percentage of my context window is actually relevant to the current task?

Open your context. Count the tokens. How many are signal? How many are noise?

Target: Above 80% signal ratio. If you’re below 50%, you have a context problem, not a model problem.

2. Are there conflicting instructions anywhere in my context?

Search for phrases like “always,” “never,” “make sure to,” “important:”. Do any of them contradict each other? Are there outdated policies that were never removed?

3. Am I keeping tool results I’ll never reference again?

That database query from turn 8—does the agent still need it at turn 50? Probably not. Implement automatic cleanup for completed tool outputs.

4. Do my few-shot examples match the current problem type?

If you’re doing code analysis, are you showing examples of data transformation? If you’re processing a 100-file refactor, are your examples all single-file bug fixes?

5. Have I summarized or cleared old conversation turns?

Something said 30 turns ago—is it still relevant? If an entire sub-task is complete, does the agent still need the full history of that task?

Quick Wins: Immediate Improvements

If you’ve identified issues, here are fixes you can implement today:

  1. Set maximum token budgets per context section. Tool outputs: 20K max. Conversation history: 15K max. Retrieved docs: 10K max.

  2. Implement automatic tool output summarization after 5 turns. If a tool result hasn’t been referenced in 5 turns, summarize it or drop it.

  3. Add instruction versioning with clear “deprecated” markers. [DEPRECATED as of turn 35] makes it obvious what’s current.

  4. Create dynamic few-shot example pools. Categorize examples by task type and inject only matching examples.

  5. Add context quality metrics to your observability stack. Track signal ratio, instruction conflicts, and memory usage over time.


Practical Implementation Patterns

Let’s get concrete. Here are battle-tested patterns for managing context in production AI agents.

Pattern 1: Context Windowing and Summarization

Don’t keep full history forever. Use a rolling window with periodic summarization:

Turn 1-20:   Full history
Turn 21:     Summarize turns 1-20 → "Digest A"
Turn 21-40:  Digest A + Full history
Turn 41:     Summarize turns 21-40 → "Digest B"  
Turn 41-60:  Digest A + Digest B + Full history
...

Implementation:

def manage_context(messages, max_tokens=50000, summary_threshold=30):
    """Manage context with automatic summarization."""
    
    if count_tokens(messages) > max_tokens:
        # Find a natural break point
        cutoff = len(messages) // 2
        
        # Summarize older messages
        old_messages = messages[:cutoff]
        summary = llm.summarize(old_messages, prompt="""
            Create a factual summary of this conversation including:
            - Key decisions made
            - Current state of all tasks
            - Important instructions that are still active
            - Open threads or pending items
        """)
        
        # Replace with summary + recent messages
        return [{"role": "system", "content": f"Previous context: {summary}"}] \
               + messages[cutoff:]
    
    return messages

Pattern 2: Relevance Filtering Before Injection

Don’t blindly inject everything that matches your RAG query:

def filter_for_relevance(retrieved_docs, current_task, threshold=0.7):
    """Filter retrieved documents by task relevance, not just similarity."""
    
    relevant_docs = []
    
    for doc in retrieved_docs:
        # Score relevance specifically to current task
        relevance = llm.score_relevance(
            document=doc,
            task=current_task,
            prompt="On a scale of 0-1, how directly relevant is this document "
                   "to completing the current task? Consider: Does it contain "
                   "information the agent needs RIGHT NOW, not just related info?"
        )
        
        if relevance > threshold:
            relevant_docs.append(doc)
    
    # Limit to top 3 most relevant
    return sorted(relevant_docs, key=lambda x: x.relevance, reverse=True)[:3]

Pattern 3: Dynamic Few-Shot Selection

Match examples to the task:

EXAMPLE_BANK = {
    "small_fix": [
        {"input": "Fix typo in README", "output": "..."},
        {"input": "Update version number", "output": "..."},
    ],
    "refactor": [
        {"input": "Restructure auth module", "output": "..."},
        {"input": "Split monolith service", "output": "..."},
    ],
    "security_review": [
        {"input": "Audit payment endpoint", "output": "..."},
        {"input": "Check input validation", "output": "..."},
    ]
}

def select_examples(task_description):
    """Select few-shot examples matching the task type."""
    
    # Classify the task
    task_type = llm.classify(
        task_description,
        categories=["small_fix", "refactor", "security_review", "other"]
    )
    
    # Get matching examples
    examples = EXAMPLE_BANK.get(task_type, [])
    
    # Rotate to prevent overfitting
    if len(examples) > 2:
        return random.sample(examples, 2)
    return examples

Pattern 4: Periodic Context Resets with State Preservation

Every N turns, create a clean checkpoint:

def checkpoint_and_reset(agent_state, turn_number, reset_interval=25):
    """Reset context periodically while preserving critical state."""
    
    if turn_number % reset_interval != 0:
        return agent_state.context
    
    # Create checkpoint
    checkpoint = {
        "summary": llm.summarize(agent_state.context),
        "current_goal": agent_state.goal,
        "completed_tasks": agent_state.completed,
        "active_instructions": extract_current_instructions(agent_state),
        "key_entities": extract_entities(agent_state),
        "open_threads": agent_state.pending_items,
        "turn_number": turn_number
    }
    
    # Save checkpoint to file for recovery
    save_checkpoint(f"checkpoints/turn_{turn_number}.json", checkpoint)
    
    # Build fresh context from checkpoint
    fresh_context = build_context_from_checkpoint(checkpoint)
    
    return fresh_context

Pattern 5: Sub-Agent Architecture

For complex tasks, use specialized sub-agents with isolated contexts:

Main Agent (Strategic)

    ├── Research Sub-Agent
    │   └── Own context: research docs, web results
    │   └── Returns: structured summary, not raw data

    ├── Code Review Sub-Agent  
    │   └── Own context: code files, review criteria
    │   └── Returns: findings list, not file contents

    └── Communication Sub-Agent
        └── Own context: user preferences, message history
        └── Returns: drafted message, not conversation log

Benefits:

  • Each sub-agent has a clean, focused context
  • Raw data stays in sub-agent context
  • Main agent receives only summaries
  • Reduces context pollution at the strategic level

Implementation:

class SubAgent:
    def __init__(self, name, system_prompt, max_context=30000):
        self.name = name
        self.system_prompt = system_prompt
        self.max_context = max_context
        self.context = []
    
    def execute(self, task):
        # Sub-agent works in its own isolated context
        result = llm.complete(
            system=self.system_prompt,
            messages=self.context + [{"role": "user", "content": task}],
        )
        
        # Return structured summary, not raw context
        return self.summarize_for_parent(result)
    
    def summarize_for_parent(self, result):
        """Convert detailed result to structured summary for parent agent."""
        return llm.complete(
            system="Extract only the key findings and decisions. "
                   "Be concise. Format as bullet points.",
            messages=[{"role": "user", "content": result}]
        )

Clean the Desk First

Let’s bring it all together.

The Key Insights

  1. Agent failures are context failures, not model failures. When your agent degrades at turn 50, the model didn’t get dumber—your context got messier.

  2. Signal drowning, conflicting instructions, and pattern pollution are the three silent killers. They’re invisible unless you’re looking for them, and they accumulate over time.

  3. The Model Upgrade Trap is real. ~15% improvement from upgrading models vs. ~40% improvement from cleaning context—at zero cost.

  4. Context quality degrades over time. Implement context hygiene proactively, not reactively.

  5. Modern techniques exist. The 12-Factor Agent framework, file system memory, MCP, and long-term memory systems like Letta are production-ready solutions.

The Mental Model Shift

From: “I need a better model.”

To: “What’s in my context window?”

The Call to Action

The next time your agent fails and you think “I need a better model,” stop.

Open your context window. Count the signal. Count the noise.

You’re probably asking a brilliant engineer to work at a desk buried in printouts from last month’s project.

Clean the desk first.

The model you have is likely good enough.

Your context isn’t.


Further Reading

External Resources


FAQs

General Context Management

How often should I clean my agent’s context?

For long-running agents, implement automatic cleanup every 15-25 turns. For simpler use cases, monitor context size and clean when you reach 70% of your token budget.

Rule of thumb:

  • Task-based agents: Clean after each completed subtask
  • Conversational agents: Summarize every 15-20 messages
  • Long-running autonomous agents: Checkpoint every 25 turns

What’s the ideal context size for optimal performance?

Research suggests 40-70% utilization of your context window produces the best results. More important than size is the signal-to-noise ratio—aim for 80%+ relevant content.

Context Size Sweet Spots:
──────────────────────────────────────────────────
128K context window → Target: 50-90K tokens active
200K context window → Target: 80-140K tokens active  
1M context window   → Target: 400-700K tokens active

Can’t I just use the biggest context window available?

Bigger isn’t automatically better. A lean 50K context with 90% signal outperforms a bloated 500K context with 30% signal. Costs also scale linearly with context size.

The math:

  • 500K context at 30% signal = 150K useful tokens + 350K noise
  • 50K context at 90% signal = 45K useful tokens, zero noise
  • The 50K context will typically perform better and cost 90% less

Measuring & Monitoring

How do I measure context quality?

Track these five key metrics:

  1. Signal Ratio = (relevant tokens / total tokens) × 100
  2. Instruction Conflicts = count of contradictory policies in context
  3. Stale Tool Outputs = % of tool results older than 10 turns
  4. Task Success Rate Over Time = success rate at turn N vs. turn 1
  5. Cost per Successful Task = total API cost / successful completions

Implementation:

def calculate_context_health(context):
    return {
        "signal_ratio": calculate_signal_ratio(context),
        "instruction_conflicts": find_conflicts(context),
        "stale_tool_outputs": count_stale_results(context, threshold_turns=10),
        "total_tokens": count_tokens(context),
        "utilization": count_tokens(context) / MAX_CONTEXT_SIZE
    }

What tools can I use to visualize context over time?

While specific tool recommendations change quickly, look for:

  • LLM observability platforms (Langfuse, Helicone, Weights & Biases)
  • Custom dashboards tracking the metrics above
  • Token counting libraries for real-time monitoring

Minimum viable monitoring: Log context size and success rate for every request. Plot over time.

Technical Implementation

Should I use RAG or long context?

Both—they’re complementary. Use this decision framework:

SituationUse RAGUse Long Context
Large document corpus (100+ docs)
Single document analysis
Cost-sensitive application
Complex cross-document reasoning
Frequently changing data
One-time deep analysis

How do I handle multi-agent context sharing?

Key principles for multi-agent systems:

  1. Pass summaries, not raw data between agents
  2. Use MCP for standardized context sharing
  3. Isolate contexts — each agent has its own clean context
  4. Orchestrator receives structured outputs, not full agent histories
# Good: Structured handoff
research_result = research_agent.run(task)
coding_agent.run(research_result.summary)  # Just the summary

# Bad: Raw context dump
research_result = research_agent.run(task)
coding_agent.context.extend(research_agent.context)  # Context pollution!

What’s the performance cost of context summarization?

Summarization adds latency but typically pays for itself:

Without summarization (at turn 50):
  • Context size: 180K tokens
  • LLM call latency: 8.2 seconds
  • Cost: $0.54
  • Success rate: 45%

With summarization (at turn 50):
  • Summarization overhead: 1.2 seconds (every 20 turns)
  • Context size: 35K tokens  
  • LLM call latency: 2.1 seconds
  • Cost: $0.12
  • Success rate: 89%

Net benefit: Faster, cheaper, more reliable

How do I test context quality in CI/CD?

Add context quality checks to your test suite:

def test_context_quality_after_50_turns():
    agent = create_agent()
    
    # Simulate 50 turns of work
    for i in range(50):
        agent.process(sample_tasks[i])
    
    health = agent.get_context_health()
    
    assert health["signal_ratio"] > 0.8, "Signal ratio too low"
    assert health["instruction_conflicts"] == 0, "Found conflicting instructions"
    assert health["stale_tool_outputs"] < 0.1, "Too many stale tool outputs"
    assert health["utilization"] < 0.7, "Context over 70% capacity"

When should I use context windowing vs. full reset?

Use this decision tree:

Is continuity required between segments?
     |
     +── YES → Use Context Windowing
     |         (Summarize old, keep recent)
     |
     +── NO → Is the subtask complete?
              |
              +── YES → Use Full Reset with Checkpoint
              |         (Save state, start fresh)
              |
              +── NO → Continue with current context

Troubleshooting

My agent suddenly started making mistakes. What happened?

Check these in order:

  1. Context size — Did you hit capacity? Check token count.
  2. Recent policy changes — Did you add instructions without removing old ones?
  3. Tool output accumulation — Are old results drowning new ones?
  4. Few-shot mismatch — Did the task type change without updating examples?

My agent ignores specific instructions. Why?

Most likely causes:

  • Lost in the middle: Instruction buried in middle of long context
  • Conflicting instructions: Another instruction contradicts it
  • Too many “always” statements: Model can’t follow all of them

Quick fix: Move the critical instruction to the beginning or end of your system prompt, and explicitly remove any conflicting statements.

My agent’s responses are getting slower over time. Why?

This is almost always context bloat:

  • More tokens = more attention computation = slower inference
  • Linear relationship: 2x tokens ≈ 2x latency

Solution: Implement aggressive context cleanup. Response times should stay consistent across turns if context size stays consistent.


Troubleshooting Guide: Diagnosing Context Problems

Use this guide when your agent starts behaving unexpectedly.

Symptom → Cause → Fix Reference

SymptomLikely CauseSection to Review
Agent ignores specific instructionsSignal Drowning or Conflicting InstructionsSignal Drowning, Conflicting Instructions
Agent applies wrong approach to taskPattern PollutionPattern Pollution
Performance degrades over timeContext RotContext Rot Problem
Agent makes up factsLost in the Middle + NoiseMillion Token Misconception
Success rate dropping at high turn countsAll three silent killersCase Study
Costs increasing over timeContext bloatCost Implications
Agent gives inconsistent answersConflicting instructionsRule 2: Single Source of Truth

The 5-Minute Context Audit

Run this checklist when debugging agent issues:

☐ 1. CHECK CONTEXT SIZE
     Current tokens: _____ / Max: _____
     Utilization: _____% 
     → If >80%, implement cleanup

☐ 2. SEARCH FOR CONFLICTS
     Count of "always": _____
     Count of "never": _____
     Count of "must": _____
     → Review each for contradictions

☐ 3. AUDIT TOOL OUTPUTS
     Tool results in context: _____
     Results older than 10 turns: _____
     → If >5 stale results, clean up

☐ 4. CHECK FEW-SHOT EXAMPLES
     Current task type: _____
     Example task types: _____
     → If mismatch, update examples

☐ 5. VERIFY INSTRUCTION VISIBILITY
     Critical instruction at token position: _____
     Is it in first 10% or last 10% of context?
     → If no, move it

Emergency Context Reset

When all else fails, here’s how to recover:

def emergency_reset(agent):
    """Nuclear option: Full context reset with state preservation."""
    
    # 1. Extract what we absolutely need
    critical_state = {
        "current_goal": agent.extract_current_goal(),
        "active_policies": agent.get_current_policies(),  # Latest only
        "pending_items": agent.get_open_threads(),
        "key_entities": agent.extract_key_entities(),
    }
    
    # 2. Save for recovery if needed
    save_checkpoint(critical_state)
    
    # 3. Build fresh context
    fresh_context = [
        {"role": "system", "content": agent.system_prompt},
        {"role": "system", "content": f"Current goal: {critical_state['current_goal']}"},
        {"role": "system", "content": f"Active policies:\n{critical_state['active_policies']}"},
        {"role": "system", "content": f"Pending items:\n{critical_state['pending_items']}"},
    ]
    
    # 4. Replace agent context
    agent.context = fresh_context
    
    return agent


Last updated: December 2025

Have questions about AI agent context management? Share your thoughts and experiences below.

Tags

#AI agents #context engineering #LLM optimization #context window #AI development #agentic AI

Enjoyed this article?

Share it with your network or let us know your thoughts.