Introduction
In Chapter 3, we covered tokens and context windows — the working memory limit of an AI system and why every token has a cost.
In Chapter 7, we covered sampling — why the same prompt can produce different outputs and how randomness affects reliability.
In Chapter 8, we covered hallucination — why fluent language is not proof, and how grounding, retrieval, and verification reduce product risk.
Now we go deeper into a topic that sounds like good news in every vendor slide deck:
Longer context windows.
Model providers keep announcing larger context limits — hundreds of thousands or millions of tokens. Product teams hear that and think: “Great, we can finally send everything.”
That instinct is understandable. It is also dangerous.
A long context window increases capacity. It does not automatically increase focus, accuracy, or trust. Many production failures happen inside the window — when context is noisy, stale, duplicated, or poorly prioritized.
This chapter explains the tradeoffs product managers must own: context rot, attention budget, context engineering, compaction, structured notes, just-in-time retrieval, sub-agents, and when long context beats RAG — or does not.
The simple PM version
Bigger context is a bigger desk — not a cleaner desk.
Capacity → Curation → Retrieval → Focus.
Design what enters context, not only how much fits.
1. The PM Promise: What This Chapter Helps You Decide
After reading this chapter, you should be able to explain to engineering, design, and leadership:
- why a 1M-token window does not mean “upload the entire knowledge base,”
- why agent workflows degrade even when they technically fit in context,
- when to use long context, RAG, tools, compaction, or sub-agents,
- how context choices affect cost, latency, hallucination, and debuggability,
- what to specify in PRDs beyond “use the biggest model.”
The goal is not to memorize context limits. The goal is to make better product decisions when context is the bottleneck — which it often is.
PM takeaway
Long context is a design constraint and a product lever. Treat it like capacity planning, not a magic feature flag.
2. Capacity vs Focus: Bigger Window, Same Attention Problem
A context window is the maximum amount of text (measured in tokens) the model can process in one request — input plus output. A long context window means more room for documents, chat history, tool results, and instructions at once.
That capacity is genuinely useful when:
- the model must compare several long documents in one pass,
- a codebase slice must be reasoned about together,
- an agent needs recent tool results plus task state without constant re-fetching,
- a user expects continuity across a long working session.
But capacity is not the same as focus. The model still has to decide what matters inside a large pile of text. More text often means more competition for attention — not automatically better answers.
PM analogy
Giving an analyst a bigger desk does not help if you also dump every email, draft, and duplicate PDF on that desk before they start. They may have space to work — and still miss the one clause that decides the case.
| Question | Weak PM answer | Better PM answer |
|---|---|---|
| Can we fit it? | “Yes, the window is huge.” | “Yes, but should we send it all at once?” |
| Will quality improve? | “More context = smarter.” | “Only if context is relevant and structured.” |
| What is the cost? | “Engineering will optimize later.” | “We budget tokens per task and per user tier.” |
PM takeaway
Specify focus requirements, not only capacity requirements. “Fits in context” is not a success metric.
3. Context Rot: When More Context Makes Answers Worse
Context rot is the degradation in model behavior as the context window fills with low-value, stale, or conflicting information — even before you hit the hard token limit.
Anthropic and other teams use this term to describe a practical product reality: performance can fall as clutter grows.
| Symptom | What users see |
|---|---|
| Missed instruction | Model ignores a rule from early in the thread |
| Lost fact | Model forgets a detail from a uploaded document |
| Wrong emphasis | Model answers from an old or irrelevant section |
| Shallow reasoning | Model summarizes instead of comparing |
| Tool confusion | Agent calls the wrong tool or repeats calls |
| Workflow drift | Task goal shifts mid-session |
Example: claims review
You pass policy PDF, hospital bill, discharge summary, email trail, OCR dump, prior notes, and processor comments. Token count may be under the limit. The model may still fail to prioritize the waiting-period clause that drives the decision.
Context rot connects directly to Chapter 8: when the model loses the right evidence, it fills gaps with plausible language.
PM takeaway
Measure quality vs context size — not only overflow errors. Rot is a silent failure mode.
4. Attention Budget: Every Token Competes for Focus
Transformers use attention to relate tokens to each other. You do not need the math to manage the product — you need the metaphor: attention budget.
Every instruction, example, document chunk, tool definition, tool result, and chat turn consumes budget. High-value tokens (task goal, authoritative source, current user question) compete with noise (old tool JSON, duplicate summaries, verbose logs).
| Context component | Budget impact |
|---|---|
| System prompt + policies | Fixed overhead every call |
| Tool schemas (many tools) | Large hidden tax on agents |
| Raw API responses | Often high volume, low signal |
| Few-shot examples | Helpful but expensive if bloated |
| Retrieved chunks | Good if ranked well; harmful if noisy |
| Output reservation | Must leave room for the answer |
Product requirement: define a token budget per workflow stage — what must stay, what can be summarized, what must be cleared after use.
PM takeaway
Attention budget is shared. Adding context is not free — it steals focus from something else.
5. Prompt Engineering vs Context Engineering
Prompt engineering asks: What instruction should the model follow?
Context engineering asks: What information should the model see — and in what shape — before it answers?
For enterprise and agentic products, the second question usually matters more. A perfect prompt cannot compensate for missing policy clauses, wrong retrieval, or ten thousand tokens of stale tool output.
| Layer | PM owns |
|---|---|
| Prompt | Task, tone, output format, refusal rules |
| Context selection | Which sources enter this request |
| Context ordering | What appears first (authority, recency) |
| Context lifecycle | When to summarize, compact, or clear |
| Context verification | Citations, tools, human review gates |
Further reading
Anthropic Engineering: Effective context engineering for AI agents
Anthropic Docs: Context windows
PM takeaway
Write PRDs that specify context rules, not only prompt text. “Be accurate” is not a context strategy.
6. High-Signal Context: Less Volume, More Decision Value
High-signal context is information that directly changes the answer or the next action. Low-signal context is technically related but not needed for this step.
| Low signal | High signal |
|---|---|
| Full email thread | Latest customer question + claim ID |
| Entire policy PDF | Relevant clauses + exclusions for this diagnosis |
| Raw CRM export | Member status, plan tier, effective dates |
| Complete repo tree | Files touched by the bug + test failures |
Design pipelines that extract, rank, and format before the model call — not dump-then-hope.
PM takeaway
Define “minimum sufficient context” per task type. That becomes your default retrieval and UI behavior.
7. System Prompts: Stable Rules in a Crowded Window
The system prompt (or developer instructions) sets durable behavior: role, safety, output schema, escalation, citation rules. It is loaded on every call and competes for attention budget.
Common PM mistakes:
- system prompt grows into a 20-page policy manual,
- duplicate rules appear in system prompt and user message,
- conflicting instructions between system, tools, and retrieved docs,
- no versioning when compliance updates.
| Belongs in system prompt | Belongs elsewhere |
|---|---|
| Role, tone, refusal boundaries | Full policy PDF → retrieval |
| Output JSON schema | Live claim status → tool/API |
| Citation required for factual claims | Case-specific facts → context block |
PM takeaway
Keep system prompts lean and versioned. Push volatile knowledge to retrieval or tools.
8. Few-Shot Examples: Useful, Expensive, Easy to Bloat
Few-shot examples teach format and behavior by demonstration. They work — and they consume tokens fast. Three long examples can cost more than a retrieved policy summary.
| When few-shot helps | When to avoid |
|---|---|
| Strict JSON extraction | Task is simple classification |
| Consistent adjudication wording | Examples contradict updated policy |
| Brand tone for short replies | You need live data from APIs |
Prefer one canonical example plus schema validation over ten similar examples. Refresh examples when rules change — stale few-shot is a silent compliance bug.
9. Tool Context: Definitions, Calls, and Raw Results
Agents add tool definitions, call arguments, raw responses, and follow-up reasoning to context. Each tool call can multiply token use and rot risk.
| Design choice | Product effect |
|---|---|
| Fewer tools per task | Less schema noise, better tool choice |
| Narrow tool responses | Less raw JSON in context |
| Summarize after tool use | Keeps facts, drops payload |
| Clear old tool results | Reduces rot in long sessions |
Specify in requirements: which tools exist per workflow, max calls per turn, what fields return, and when results leave context.
PM takeaway
Tool design is context design. Bloated tools are a tax on every agent step.
10. Just-in-Time Retrieval: Fetch When Needed, Not Up Front
Just-in-time (JIT) retrieval loads documents or data only when the task needs them — often via search, RAG, or tool calls during the conversation — instead of pre-loading everything into context.
Benefits:
- lower cost per turn,
- fresher data (query at answer time),
- less rot from irrelevant material,
- clearer audit trail (“retrieved clause 4.3 at 10:42”).
Tradeoff: retrieval quality must be good. JIT does not remove the need for ranking, versioning, and fallback when nothing relevant is found.
PM takeaway
Default to JIT for large corpora. Reserve long context for tasks that truly need cross-document reasoning in one pass.
11. Progressive Disclosure: Reveal Detail in Stages
Progressive disclosure in AI products means showing summary first, then letting the user or agent drill into sections, attachments, or tool-backed detail.
Example: policy assistant shows “Maternity waiting period may apply — see Clause 4.3” with expand-to-source, instead of injecting the full 200-page booklet on every question.
| UI pattern | Context effect |
|---|---|
| Summary card + “View source” | Small default context |
| User picks documents | Explicit focus |
| Agent requests next file | JIT expansion |
| Section-level chunking | Targeted retrieval |
PM takeaway
UX and context engineering are the same decision. Disclosure patterns control token load.
12. Compaction: Shrink History Without Losing the Thread
Compaction replaces long chat history or tool traces with a shorter summary while preserving goals, decisions, open questions, and constraints.
Compaction is not “delete everything.” Bad compaction drops audit-critical facts.
| Keep in compact state | Safe to drop |
|---|---|
| User goal, claim ID, constraints | Verbose tool stderr |
| Decisions made, pending approvals | Duplicate retrieved chunks |
| Source IDs for cited facts | Intermediate draft paragraphs |
Trigger compaction at predictable points: N turns, M tokens, after tool-heavy subtasks, or before handoff to another agent.
13. Structured Note-Taking: Memory Outside the Window
Instead of keeping everything in the model’s working memory, agents can write structured notes — JSON, tables, or task logs stored outside context and re-injected when needed.
| Field | Example |
|---|---|
| task_id | CLM-2026-8841 |
| finding | Discharge summary missing |
| evidence | Checklist §3.2; upload scan empty |
| next_action | Raise shortfall |
| confidence | high |
Structured notes support human review, replay, and compliance — and they re-enter context as small high-signal blocks, not raw logs.
PM takeaway
Define the note schema in the PRD. Unstructured agent scratchpads do not scale to production.
14. Sub-Agents: Fresh Context for Heavy Subtasks
A sub-agent runs a focused subtask in a separate context — research one topic, extract one document, run one tool chain — then returns a compact result to the main agent.
Why PMs should care:
- isolates rot (research noise does not pollute main thread),
- allows different models or budgets per subtask,
- improves debuggability (subtask inputs/outputs are bounded).
Example: main claims agent keeps decision state; sub-agent reads only the policy PDF and returns clause list + citations.
15. Multi-Agent Context: Handoffs, Duplication, and Drift
Multiple agents multiply context risk: duplicated instructions, conflicting summaries, and handoffs that drop evidence.
| Risk | Control |
|---|---|
| Duplicate system prompts | Shared policy pack, role-specific deltas only |
| Lost citations on handoff | Pass source IDs, not paraphrases only |
| Contradictory summaries | Single structured state object |
| Runaway token use | Per-agent budgets and max turns |
PM takeaway
Multi-agent architecture needs a context contract between agents — what must survive every handoff.
16. Long-Horizon Workflows: Sessions That Outlive One Window
Real products run for hours or days: coding agents, claims investigations, research projects. No single context window holds the entire journey.
Product patterns:
- checkpoint structured state to database,
- compact conversation at milestones,
- restart context with summary + open tasks,
- expose “session memory” the user can edit or delete.
Pair with Chapter 7 for stable factual steps (low temperature) on compaction and extraction subtasks.
17. Long Context vs RAG: When to Use Which
Long context and RAG are complements, not substitutes.
| Approach | Strength | Weakness |
|---|---|---|
| Long context | Cross-doc reasoning in one pass | Cost, latency, rot risk |
| RAG | Finds relevant slices from huge corpus | Bad retrieval → bad answers |
| Tools / APIs | Current system-of-record data | Workflow complexity |
| Both | Broad awareness + precise grounding | Engineering and eval overhead |
| Use case | Lean toward |
|---|---|
| Compare 5 clauses in one contract | Long context (bounded set) |
| Answer from 50k internal policies | RAG + citation |
| Live claim status | Tool/API, not static context |
| Repo-wide refactor plan | RAG + graph + sub-agents |
PM takeaway
Do not choose long context because RAG is hard. Fix retrieval — or use long context for bounded, high-value bundles only.
18. Cost and Latency: Context Is a Unit Economics Problem
Providers often price by tokens in and out. Long context requests cost more and often run slower — especially with large input plus reasoning or tools.
| Driver | PM action |
|---|---|
| Input size | Cap uploads; default summarization |
| Output size | Structured short fields vs essays |
| Model tier | Route simple steps to smaller models |
| Agent loops | Max turns; stop conditions |
| Caching | Reuse stable system + doc prefixes where supported |
See Chapter 3 for token fundamentals. This chapter adds the tradeoff lens: bigger context is a product pricing decision, not only a technical one.
19. Debuggability: Can You Explain What the Model Saw?
When an answer is wrong, teams ask: bad model, bad prompt, or bad context? Without visibility, PMs and engineers guess.
Minimum observability for context-heavy products:
- token count per component (system, retrieval, tools, history),
- retrieved chunk IDs and scores,
- compaction events (before/after summary),
- tool call trace with redacted payloads,
- version of policy / knowledge index.
PM takeaway
Require “context replay” in eval and support tooling. If you cannot replay context, you cannot fix rot.
20. Product Patterns That Manage Context Well
| Pattern | Behavior |
|---|---|
| Select before send | User or system picks relevant docs |
| Summarize then reason | Map-reduce for very large inputs |
| Retrieve then answer | RAG with citation UI |
| Compact at milestones | Scheduled context shrink |
| Sub-agent isolation | Heavy read in child context |
| Graceful overflow | Offer split, batch, or priority — not raw error |
| Source-first UX | Show what was used; allow challenge |
Pick patterns per risk tier: brainstorming can be loose; claims and compliance cannot.
21. Practical Example: Claims Adjudication Assistant
Weak design: Upload all documents every turn; 400k tokens of PDFs and emails in one call.
Better design:
- structured case header (claim type, dates, diagnosis codes),
- JIT retrieval of policy clauses by procedure,
- tool fetch for member eligibility and prior auth status,
- structured note: missing docs, cited checklist rows,
- low-temperature extraction sub-agent; main agent decides next action,
- human review gate before customer-facing denial.
PM takeaway
Context design reduces hallucination risk from Chapter 8 — the model sees the right evidence, not the largest pile.
22. Practical Example: Codebase Assistant
Developers want “whole repo” understanding. Dumping the entire repository into context fails on cost, rot, and stale files.
Better design:
- retrieve symbols and files related to stack trace or ticket,
- sub-agent explores test failures in isolated context,
- progressive disclosure: summary of architecture, expand file on click,
- compact after each fix attempt — keep test output summary, not full logs,
- cite file paths and line ranges in answers.
Long context helps for a bounded slice (e.g. one service’s core files), not the monorepo zip.
23. Practical Example: Research and Legal Review
Research workflows tempt “load all papers.” Better: JIT retrieval by query, structured notes per source (claim, method, limitation, citation), sub-agent per paper for extraction, main agent synthesizes with explicit uncertainty.
Legal review: long context for one agreement’s interconnected clauses; RAG across precedent library; never mix unlabeled draft and executed contract in one undifferentiated block.
PM takeaway
Synthesis quality depends on note quality, not on maximum PDF count in one prompt.
24. What PMs Should Not Do
| Mistake | Why it fails |
|---|---|
| “We have 1M context — send everything” | Rot, cost, and missed focus |
| Skip retrieval because window grew | Corpus still too large; stale chunks |
| Never compact agent sessions | Drift and tool noise accumulate |
| Giant system prompt as knowledge base | Stale rules; attention waste |
| No token or context telemetry | Cannot debug wrong answers |
| Same context strategy for all tiers | Enterprise needs stricter curation |
| Assume long context fixes hallucination | More text can increase unsupported claims |
| Ignore output reservation | Truncated or empty answers |
Context strategy is a product discipline — not a model marketing checkbox.
25. Decision Framework: Five Questions Before Every Feature
- What decision does this step support? — Only include context that changes that decision.
- What is authoritative? — Policy PDF vs email rumor; system of record vs chat memory.
- What must be fresh? — JIT tool or retrieval vs static upload.
- What fits the budget? — Token cap, compaction triggers, sub-agent split.
- How will we verify? — Citations, structured output, human review — per Chapter 8.
| If… | Then lean toward… |
|---|---|
| Corpus is huge, question is narrow | RAG + JIT |
| Few long docs must be compared | Bounded long context |
| Session runs many tool steps | Compaction + structured notes |
| Task is exploratory / creative | Smaller context OK; higher sampling OK |
| High-risk factual output | Minimal context + verify + human gate |
26. The PM Mental Model
Long context gives capacity. Reliable products need curation, retrieval, and focus.
Capacity → Curation → Retrieval → Focus.
Context engineering is how you spend attention budget wisely.
| Layer | Purpose |
|---|---|
| Capacity | Know the window and reserve output space |
| Curation | High-signal, ordered, authoritative context |
| Retrieval | JIT and RAG for large or fresh knowledge |
| Compaction | Shrink history without losing decisions |
| Structured notes | State outside the window |
| Sub-agents | Isolate heavy reads |
| Observability | Replay what the model saw |
| Verification | Citations and review for factual claims |
Do not buy a bigger window and call it a strategy. Design the path from capacity to focus.
Chapter Summary
| Concept | PM understanding |
|---|---|
| Long context window | More capacity, not automatic quality |
| Context rot | Performance degrades as clutter grows |
| Attention budget | Every token competes for focus |
| Context engineering | What the model sees, not only what it is told |
| JIT retrieval | Fetch when needed; stay fresh |
| Compaction | Shrink history; keep decisions and sources |
| Structured notes | Auditable state outside the window |
| Sub-agents | Isolate heavy subtasks |
| Long context vs RAG | Use each for the right problem |
| Cost and latency | Context size drives unit economics |
| PM role | Define context rules, budgets, and fallbacks in the product |
Closing Thought
Vendor announcements will keep pushing larger context numbers. Your job is to translate that into product behavior users can trust.
Users do not experience “tokens.” They experience whether the assistant remembered the right clause, ignored the wrong email, cited the current policy, and stayed on task after the twentieth tool call.
The teams that win treat context like product infrastructure: curated, retrieved, compacted, observable, and matched to risk — with hallucination controls from Chapter 8 still in place.
The next chapter compares model families — how to choose between GPT, Claude, Gemini, Llama, and others when context strategy, cost, and quality requirements differ by use case.
The real PM lesson
A bigger window is permission to design better — not permission to stop designing.
Chapter navigation
Chapter 8: Why LLMs Hallucinate — The PM Version
Why LLMs sound confident when they are wrong — and how product teams design around hallucination.
Read chapter → Next →Chapter 10: Pre-training vs Fine-tuning vs RLHF — The PM Version
How pre-training, fine-tuning, and RLHF differ — and what each layer means for AI products.
Read chapter →