Introduction
In Chapter 2, we understood how transformers became Large Language Models and why LLMs should be seen as prediction systems, not databases.
This chapter goes one level deeper into something every AI product manager must understand: tokens and context windows.
At first, this may sound like a technical backend topic. It is not.
Tokens and context windows directly influence:
- how much information the model can read,
- how much it can remember during a task,
- how much an AI call costs,
- how reliable the answer is,
- how long an AI workflow can run,
- how well an AI agent can use tools,
- and whether the final experience feels intelligent or confused.
A lot of AI products fail not because the model is weak, but because the product passes the wrong context, too much context, stale context, or no context at all.
The simple PM version
Context is not storage. Context is working memory.
And working memory must be designed.
1. What Is a Token?
A token is the unit of text that an LLM reads and generates.
People casually say that models read "words," but that is not technically accurate. Models process tokens.
| Input | Possible Token Meaning |
|---|---|
hospital | A full word |
authorization | May be split into smaller pieces |
₹50,000 | Currency symbol, number, comma, or grouped chunks |
function getClaim() | Code tokens |
Pre-Auth | Word fragments and punctuation |
. | Punctuation |
2026 | Number token |
The exact tokenization depends on the model's tokenizer, but the product implication is simple:
PM translation
The model does not see your content exactly like a human sees paragraphs, pages, and documents. It sees a sequence of tokens — and that sequence has limits.
2. Why Tokens Matter for Product Managers
Tokens are not just a technical billing detail. They are a product constraint.
Every AI feature has a token economy.
When a user asks a question, the model may receive:
| Context Component | Example |
|---|---|
| System prompt | "You are a claim adjudication assistant." |
| User message | "Review this case." |
| Conversation history | Previous user and assistant turns |
| Retrieved documents | Policy wording, SOP, claim documents |
| Tool definitions | Available APIs or tools the model can call |
| Tool results | Database lookup, OCR output, search result |
| Instructions | Output format, refusal rules, tone |
| Current response | Tokens generated by the model |
All of this consumes the available context window.
The model does not only process the user's visible question. It processes the full package sent to it — hidden instructions, prior conversation, retrieved files, tool schemas, API results, and expected output space.
PM Takeaway
When you design an AI feature, you are not only designing the UI. You are deciding:
| PM Decision | Token Impact |
|---|---|
| How much chat history to retain | More history = more tokens |
| How many documents to retrieve | More documents = more tokens |
| How detailed the system prompt should be | More instruction = more tokens |
| Whether to include full files or summaries | Full files consume more context |
| Whether to use tools | Tool definitions and outputs consume tokens |
| Whether to support long workflows | Long workflows need context strategy |
PM takeaway
A PM who does not understand tokens will design AI features that look good in demos and fail in production.
3. What Is a Context Window?
The context window is the total amount of information the model can consider while generating a response.
Think of it as the model's working desk. You can place documents, instructions, user queries, examples, tool results, and prior conversation on the desk. The model can work with what is on the desk.
But if something is not on the desk, the model cannot reliably use it.
The context window is different from the model's training data.
| Concept | Meaning |
|---|---|
| Training data | The broad data the model learned from during training |
| Context window | The current information available for this specific request |
This distinction is critical. The model may have general knowledge from training, but for a specific business task, you need to provide the right working context.
Simple Analogy
| Human Workflow | LLM Workflow |
|---|---|
| A person has long-term memory | Model has training data |
| A person opens files on a desk | Model receives context |
| A person can only focus on visible material | Model can only attend to tokens in context |
| A person may forget earlier conversation | Model may lose access when context is removed |
| A person performs better with organized notes | Model performs better with curated context |
PM takeaway
The context window should not be treated casually. It is one of the most important design surfaces in an AI product.
4. Context Window Is Not the Same as Memory
This is one of the most common misunderstandings in AI product discussions.
People say: "The model should remember this." But what do they mean by remember?
There are different types of memory in AI products.
| Type | Meaning | Example |
|---|---|---|
| Training memory | Patterns learned during model training | General language, coding patterns, public facts |
| Context window | Current working memory | Current prompt, documents, chat history |
| Product memory | Stored user or project data | CRM notes, preferences, prior tasks |
| Agent state | Workflow state across steps | Current task, completed actions, next action |
| Long-term memory | Persistent external memory | Vector DB, database, file store |
The context window is only one layer. It is temporary. Once the conversation changes, gets summarized, or exceeds the limit, earlier information may be dropped, compressed, or no longer visible to the model.
PM Takeaway
Do not use the context window as your product's long-term memory. Use proper storage.
| Need | Better Design |
|---|---|
| Remember user preference | Store in profile/database |
| Remember past claim decision | Store in system of record |
| Remember prior chat summary | Store conversation summary |
| Remember project knowledge | Use RAG/document index |
| Remember agent progress | Store workflow state |
| Remember audit trail | Store structured logs |
Mental model
A context window is for active reasoning. It is not permanent memory.
5. How Context Builds Up During a Conversation
In a multi-turn conversation, earlier user messages and assistant responses are usually included again in the next request. That means context grows as the conversation continues.
| Turn | What Gets Sent |
|---|---|
| Turn 1 | System prompt + user message |
| Turn 2 | System prompt + Turn 1 user + Turn 1 assistant + new user message |
| Turn 3 | System prompt + earlier history + new user message |
| Turn 10 | A lot of accumulated conversation |
This is fine for short chats. It becomes risky for long-running workflows.
Why? Because the model may be forced to process:
- old instructions,
- repeated details,
- irrelevant discussion,
- long tool outputs,
- large documents,
- intermediate reasoning,
- duplicated content,
- stale context.
At some point, the model's working memory gets cluttered.
PM takeaway
The issue is not only hitting the maximum limit. Performance can degrade before the limit — when context gets crowded.
6. Bigger Context Is Useful, But Not Automatically Better
A larger context window allows the model to handle longer prompts, larger documents, bigger codebases, and more complex workflows. That sounds great.
But bigger context is not automatically better.
A large context window gives capacity. It does not automatically create clarity.
PM Analogy
A large desk helps if the right documents are placed neatly. But if the desk is filled with random files, duplicate printouts, old notes, irrelevant emails, and half-read PDFs, the person working on the task may actually perform worse.
Same with LLMs.
| Bad Product Thinking | Better Product Thinking |
|---|---|
| "The model supports a huge context window. Let us send everything." | "The model supports a large context window. Let us decide what deserves to enter context, in what order, and in what format." |
That is context engineering
Capacity alone does not create a reliable product. Curation does.
7. Context Rot: The Hidden Product Risk
As token count grows, model performance may degrade.
Anthropic refers to this type of degradation as context rot.
In simple language
The more cluttered the context becomes, the harder it can become for the model to reliably find and use the right information.
In AI products, context rot may look like:
| Symptom | What User Sees |
|---|---|
| Missed instruction | Model ignores an important rule |
| Lost fact | Model forgets a document detail |
| Weak retrieval | Model cites irrelevant context |
| Contradiction | Model says something inconsistent |
| Shallow answer | Model summarizes instead of reasoning |
| Bad tool choice | Agent calls the wrong tool |
| Workflow drift | Agent moves away from the original task |
For PMs, context rot explains why long AI workflows can become unreliable even when they technically fit inside the context window.
Example: Claims Review
Suppose you pass:
- full policy document,
- hospital bill,
- discharge summary,
- previous authorization letter,
- email trail,
- insurer guideline,
- claim notes,
- OCR output,
- user instructions,
- processor comments.
The model may have enough token capacity to receive all of it. But will it focus on the right sections? Not automatically.
PM takeaway
You still need retrieval, prioritization, structure, and clear instructions.
8. Context Engineering: The New PM Skill
Prompt engineering gets attention, but context engineering is more important for serious AI products.
Prompt engineering asks: What instruction should we give the model?
Context engineering asks: What information should the model see before it answers?
That second question is often more important.
Context Engineering Includes
| Area | PM Question |
|---|---|
| Context selection | What information should be included? |
| Context exclusion | What should be left out? |
| Ordering | What should appear first or last? |
| Compression | What can be summarized? |
| Freshness | Is this the latest source? |
| Authority | Which source should override others? |
| Relevance | Is this needed for this task? |
| Format | Should this be table, JSON, bullet, or text? |
| Lifecycle | When should old context be removed? |
A PM does not need to implement tokenization logic. But a PM must define the product rules for context.
Example
Weak context design: Send all claim documents to the model.
Better context design: Send the claim type, policy summary, relevant policy clauses, latest bill extract, discharge diagnosis, previous authorization status, missing document checklist, and user's current question. Exclude unrelated email trail unless the user asks for communication history.
PM takeaway
This is the difference between a demo and a product.
9. Token Budgeting: How PMs Should Think
A token budget is a planning tool. It answers: how much of the context window should be reserved for each part of the task?
A simple token budget may look like this:
| Context Component | Budget Priority |
|---|---|
| System instruction | Must include |
| User request | Must include |
| Relevant retrieved documents | Must include |
| Tool definitions | Include only required tools |
| Tool results | Include only useful results |
| Conversation history | Summarize when long |
| Output space | Reserve enough for final answer |
| Reasoning budget | Reserve if using extended reasoning |
A common mistake is filling the input context so aggressively that the model has too little room to produce a useful answer.
Remember
The context window includes both what the model reads and what it generates.
PM Takeaway
When defining an AI feature, ask:
| Question | Why It Matters |
|---|---|
| How much input context is needed? | Controls grounding |
| How much output is expected? | Controls user experience |
| How much history is useful? | Controls continuity |
| How much tool output should remain? | Controls agent reliability |
| What happens when budget is exceeded? | Controls failure behavior |
Token budgeting should be part of AI product requirements.
10. Token Counting: Measurement Before Execution
Token counting is the discipline of estimating how large your prompt is before sending it to the model.
Token count affects:
- cost,
- latency,
- rate limits,
- model routing,
- context overflow,
- prompt optimization.
For a product team, token counting should not be treated as a developer-only diagnostic. It should support product decisions.
| Product Scenario | Token Counting Use |
|---|---|
| User uploads 100-page PDF | Check whether it fits before processing |
| Agent uses many tools | Estimate tool-definition overhead |
| Chat has 50 turns | Decide when to summarize |
| Enterprise workflow has cost cap | Route to cheaper model when possible |
| User asks for detailed report | Reserve output space |
| Mobile AI feature | Optimize for latency |
If your product has no token visibility, you are flying blind.
Further reading
For tokenization and context window fundamentals, see Anthropic's context window documentation.
11. Context Overflow: What Happens When You Run Out of Room?
Context overflow happens when the input plus expected output exceeds what the model can handle. In product terms, this means the model has run out of working memory.
Bad experience: "Error: context length exceeded."
Better experience: "This document set is too large to review in one pass. I'll summarize the older documents first, then analyze the most relevant sections."
Product Fallback Options
| Problem | Product Response |
|---|---|
| Too many documents | Ask user to select priority documents |
| Long conversation | Summarize earlier discussion |
| Large tool output | Keep only important fields |
| Too many images/PDF pages | Split into batches |
| Long report requested | Generate section by section |
| Agent workflow too long | Save state and start new context |
PM takeaway
This is where AI product maturity shows. A serious AI product should not collapse when context is full — it should recover gracefully.
12. Context Management Strategies
For long-running conversations and agentic workflows, context has to be actively managed. There are three important product patterns.
Pattern 1: Summarize
Use when the conversation is long but the full detail is no longer needed.
"Summarize the first 30 turns into user goals, decisions made, unresolved questions, and next actions."
Pattern 2: Select
Use when only some documents are relevant.
Retrieve only the policy clauses related to maternity waiting period instead of sending the full policy booklet.
Pattern 3: Clear
Use when old tool outputs are no longer useful.
Remove raw API responses after extracting the final structured values.
PM Takeaway
Context management should be designed as a lifecycle.
| Stage | Context Action |
|---|---|
| Start of task | Load task instructions and key data |
| During task | Add tool results and user clarifications |
| Midway | Summarize or compact old context |
| After tool use | Clear irrelevant raw outputs |
| Before final answer | Keep only evidence and decision logic |
| After task | Store summary and audit trail externally |
Do not let context grow randomly
Random context creates random behavior.
13. Extended Thinking and Context
Some models support extended thinking, where the model can spend more tokens on internal reasoning before giving a final answer.
For PMs
Reasoning also consumes budget. If you ask a model to think deeply, you must reserve room for that thinking and for the final answer.
Where Extended Thinking Helps
| Use Case | Why It Helps |
|---|---|
| Complex claim adjudication | Multiple policy checks and deductions |
| Codebase analysis | Need to trace dependencies |
| Legal clause comparison | Need careful interpretation |
| Financial reconciliation | Multiple calculations and exceptions |
| Strategic planning | Needs trade-off analysis |
Where It May Be Unnecessary
| Use Case | Why Not |
|---|---|
| Simple rewrite | Low reasoning need |
| Basic classification | Short task |
| FAQ answer | RAG may be enough |
| Short summary | Fast response preferred |
PM takeaway
Do not enable deep reasoning everywhere. Use it where accuracy and complexity justify cost and latency.
14. Tool Use Adds More Context Pressure
AI agents often use tools. Tools are powerful, but they add context pressure.
A tool-using workflow may include:
- tool definitions,
- tool call request,
- tool result,
- assistant interpretation,
- follow-up tool call,
- additional result,
- final response.
Every tool call creates context overhead.
If your AI agent calls too many tools, keeps too many raw results, or carries unnecessary history, the workflow becomes expensive and fragile.
PM Questions for Tool-Based Products
| Question | Why It Matters |
|---|---|
| Which tools should be available for this task? | Too many tools confuse and increase overhead |
| What fields should each tool return? | Raw payloads may waste context |
| Should tool results be summarized? | Reduces clutter |
| When should tool results be cleared? | Prevents context rot |
| Which actions need approval? | Controls risk |
| What should be logged outside context? | Supports audit |
PM takeaway
Tool use is not just engineering. It is product behavior design.
15. Context Awareness: Models Knowing Their Remaining Budget
Some newer models can be made aware of their remaining context budget during long workflows. This matters because a model that knows its remaining budget can behave more intelligently.
It can decide:
- whether to be concise,
- whether to summarize,
- whether to continue,
- whether to ask for prioritization,
- whether to avoid unnecessary detail,
- whether to save state before context runs out.
PM Takeaway
Future AI products will not only manage user journeys. They will manage context journeys.
For long-running workflows, your system should know:
| State | Product Behavior |
|---|---|
| Plenty of context left | Continue normal workflow |
| Moderate context left | Start being selective |
| Low context left | Summarize and compact |
| Near limit | Save state and restart |
| Exceeded limit | Recover gracefully |
Agentic products
An AI agent without context management is like an employee working on a complex case while their notes keep disappearing.
16. Long Context vs RAG: Which Should a PM Choose?
A common PM question is: If the model has a huge context window, do we still need RAG?
Yes, often you do.
Long context and RAG solve related but different problems.
| Approach | Strength | Weakness |
|---|---|---|
| Long context | Can process a large amount of material together | Expensive, slower, risk of context rot |
| RAG | Retrieves focused, relevant information | Depends on retrieval quality |
| Summarization | Reduces volume | May lose details |
| Tool use | Gets exact current data | Adds workflow complexity |
| Memory store | Preserves long-term state | Needs retrieval and governance |
Product Rule of Thumb
- Use long context when the model genuinely needs to reason across many pieces together.
- Use RAG when the model needs the most relevant pieces from a larger knowledge base.
- Use both when the task needs broad awareness and precise grounding.
Examples
| Use Case | Better Approach |
|---|---|
| Summarize one long contract | Long context |
| Answer policy question from 10,000 documents | RAG |
| Compare 20 uploaded claim documents | Long context + structured extraction |
| Customer support from knowledge base | RAG |
| Codebase migration planning | Long context + repo graph + retrieval |
| Agent running multi-step workflow | RAG + tools + compaction |
PM takeaway
A large context window does not remove the need for product architecture.
17. How Poor Context Design Creates Hallucination
Hallucination is not only caused by weak models. It is often caused by poor context design.
| Poor Context Design | Likely Failure |
|---|---|
| Missing source document | Model guesses |
| Too many irrelevant documents | Model focuses on wrong detail |
| Old policy mixed with new policy | Model gives outdated answer |
| No source priority | Model treats weak source as strong |
| Raw OCR errors included | Model trusts bad extraction |
| Long chat history retained | Model follows stale instruction |
| Tool output too large | Model misses key fields |
PM Takeaway
When an AI answer is wrong, do not only blame the model. Ask:
- Was the right context retrieved?
- Was irrelevant context removed?
- Was the source current?
- Was the context structured?
- Was the model told which source to trust?
- Was there enough output budget?
- Was the task too broad?
PM takeaway
Many AI failures are context failures.
18. Context Design for Enterprise Workflows
Enterprise AI products need stricter context design than consumer chatbots. A consumer can tolerate a slightly vague answer. A business workflow often cannot.
Example: Pre-Auth Claim Review
A good context package may include:
| Context Item | Include? | Reason |
|---|---|---|
| Current user question | Yes | Defines task |
| Policy number and product type | Yes | Defines coverage context |
| Admission date | Yes | Needed for eligibility |
| Diagnosis and procedure | Yes | Needed for medical decision |
| Relevant policy clauses | Yes | Grounding |
| Full policy booklet | Maybe | Only if targeted clauses are not enough |
| Hospital bill line items | Yes | Needed for deductions |
| Previous authorization history | Yes | Avoids duplicate decisioning |
| Raw email trail | Maybe | Only if communication issue exists |
| Old irrelevant claims | No | Adds noise |
| Internal notes | Yes, if relevant | Operational context |
| Tool logs | No, unless needed | Avoid clutter |
How PMs should think
Not "send everything." Not "send only the user query." Send the right context.
19. A PM Checklist for Tokens and Context Windows
Before shipping an AI feature, ask:
| Question | Why It Matters |
|---|---|
| What is the maximum expected input size? | Avoid overflow |
| What is the expected output size? | Reserve generation budget |
| What documents enter context? | Control grounding |
| Who decides relevance? | Product or retrieval logic |
| What gets summarized? | Manage long sessions |
| What gets deleted from context? | Prevent clutter |
| How are old instructions handled? | Avoid conflict |
| How are tool results handled? | Prevent context bloat |
| What happens when limit is near? | Graceful fallback |
| How is token cost monitored? | Commercial viability |
| How is latency monitored? | User experience |
| How is context quality evaluated? | Accuracy and trust |
Add these questions to your AI PRDs. Tokens and context should not be afterthoughts.
20. Final Mental Model
Tokens are the units the model reads and writes.
The context window is the model's working memory.
A large context window gives capacity, but not automatic accuracy.
More context can help, but bad context can hurt.
For product managers, the real skill is not simply asking: "How many tokens does the model support?"
The better question is: "What context does the model need to complete this task reliably?"
That is the shift. AI product quality depends on context quality.
The best AI products will not dump everything into the model. They will curate, compress, retrieve, prioritize, clear, and preserve context deliberately.
The real PM version
Tokens and context windows are not just engineering details. They are product architecture.
Chapter Summary
| Concept | PM Understanding |
|---|---|
| Token | The unit of text the model processes. |
| Context window | The model's working memory for the current request. |
| Training data vs context | Training data is learned earlier; context is what the model sees now. |
| Token budget | Planning how much room each part of the task gets. |
| Token counting | Estimating prompt size before execution. |
| Context rot | Accuracy and recall can degrade as context grows. |
| Long context | Useful capacity, but not a replacement for context design. |
| RAG | Retrieves relevant external knowledge into context. |
| Compaction | Summarizes older context so long workflows can continue. |
| Context editing | Removes unnecessary old tool results or blocks. |
| Extended thinking | Uses additional output tokens for deeper reasoning. |
| Tool use | Adds power but also context overhead. |
| Context awareness | Model capability to track remaining token budget. |
| PM role | Design what enters context, what stays out, and what happens near limits. |
Closing Thought
In traditional software, product managers designed screens, flows, fields, and rules. In AI products, product managers must also design context.
That means deciding what the model sees, what it ignores, what it retrieves, what it remembers, what it forgets, and what it should do when the working memory gets crowded.
This is not a small technical detail. This is one of the foundations of reliable AI product design.
A model with a large context window can still fail if the product gives it the wrong context. A smaller model with clean, relevant, well-structured context can often outperform a larger model drowning in noise.
The real PM lesson
Better context beats bigger context.
Chapter navigation
Chapter 2: From Transformers to LLMs — The PM Version
How Transformer architecture became the foundation of modern AI products.
Read chapter → Next →Chapter 4: AI Safety, RLHF, and Constitutional AI
Why safety is product architecture — not just policy — and how RLHF and Constitutional AI shape behavior.
Read chapter →