Tokens and Context Windows — The PM Version

Introduction

In Chapter 2, we understood how transformers became Large Language Models and why LLMs should be seen as prediction systems, not databases.

This chapter goes one level deeper into something every AI product manager must understand: tokens and context windows.

At first, this may sound like a technical backend topic. It is not.

Tokens and context windows directly influence:

how much information the model can read,
how much it can remember during a task,
how much an AI call costs,
how reliable the answer is,
how long an AI workflow can run,
how well an AI agent can use tools,
and whether the final experience feels intelligent or confused.

A lot of AI products fail not because the model is weak, but because the product passes the wrong context, too much context, stale context, or no context at all.

The simple PM version

Context is not storage. Context is working memory.
And working memory must be designed.

1. What Is a Token?

A token is the unit of text that an LLM reads and generates.

People casually say that models read "words," but that is not technically accurate. Models process tokens.

Input	Possible Token Meaning
`hospital`	A full word
`authorization`	May be split into smaller pieces
`₹50,000`	Currency symbol, number, comma, or grouped chunks
`function getClaim()`	Code tokens
`Pre-Auth`	Word fragments and punctuation
`.`	Punctuation
`2026`	Number token

The exact tokenization depends on the model's tokenizer, but the product implication is simple:

PM translation

The model does not see your content exactly like a human sees paragraphs, pages, and documents. It sees a sequence of tokens — and that sequence has limits.

2. Why Tokens Matter for Product Managers

Tokens are not just a technical billing detail. They are a product constraint.

Every AI feature has a token economy.

When a user asks a question, the model may receive:

Context Component	Example
System prompt	"You are a claim adjudication assistant."
User message	"Review this case."
Conversation history	Previous user and assistant turns
Retrieved documents	Policy wording, SOP, claim documents
Tool definitions	Available APIs or tools the model can call
Tool results	Database lookup, OCR output, search result
Instructions	Output format, refusal rules, tone
Current response	Tokens generated by the model

All of this consumes the available context window.

The model does not only process the user's visible question. It processes the full package sent to it — hidden instructions, prior conversation, retrieved files, tool schemas, API results, and expected output space.

PM Takeaway

When you design an AI feature, you are not only designing the UI. You are deciding:

PM Decision	Token Impact
How much chat history to retain	More history = more tokens
How many documents to retrieve	More documents = more tokens
How detailed the system prompt should be	More instruction = more tokens
Whether to include full files or summaries	Full files consume more context
Whether to use tools	Tool definitions and outputs consume tokens
Whether to support long workflows	Long workflows need context strategy

PM takeaway

A PM who does not understand tokens will design AI features that look good in demos and fail in production.

3. What Is a Context Window?

The context window is the total amount of information the model can consider while generating a response.

Think of it as the model's working desk. You can place documents, instructions, user queries, examples, tool results, and prior conversation on the desk. The model can work with what is on the desk.

But if something is not on the desk, the model cannot reliably use it.

The context window is different from the model's training data.

Concept	Meaning
Training data	The broad data the model learned from during training
Context window	The current information available for this specific request

This distinction is critical. The model may have general knowledge from training, but for a specific business task, you need to provide the right working context.

Simple Analogy

Human Workflow	LLM Workflow
A person has long-term memory	Model has training data
A person opens files on a desk	Model receives context
A person can only focus on visible material	Model can only attend to tokens in context
A person may forget earlier conversation	Model may lose access when context is removed
A person performs better with organized notes	Model performs better with curated context

PM takeaway

The context window should not be treated casually. It is one of the most important design surfaces in an AI product.

4. Context Window Is Not the Same as Memory

This is one of the most common misunderstandings in AI product discussions.

People say: "The model should remember this." But what do they mean by remember?

There are different types of memory in AI products.

Type	Meaning	Example
Training memory	Patterns learned during model training	General language, coding patterns, public facts
Context window	Current working memory	Current prompt, documents, chat history
Product memory	Stored user or project data	CRM notes, preferences, prior tasks
Agent state	Workflow state across steps	Current task, completed actions, next action
Long-term memory	Persistent external memory	Vector DB, database, file store

The context window is only one layer. It is temporary. Once the conversation changes, gets summarized, or exceeds the limit, earlier information may be dropped, compressed, or no longer visible to the model.

PM Takeaway

Do not use the context window as your product's long-term memory. Use proper storage.

Need	Better Design
Remember user preference	Store in profile/database
Remember past claim decision	Store in system of record
Remember prior chat summary	Store conversation summary
Remember project knowledge	Use RAG/document index
Remember agent progress	Store workflow state
Remember audit trail	Store structured logs

Mental model

A context window is for active reasoning. It is not permanent memory.

5. How Context Builds Up During a Conversation

In a multi-turn conversation, earlier user messages and assistant responses are usually included again in the next request. That means context grows as the conversation continues.

Turn	What Gets Sent
Turn 1	System prompt + user message
Turn 2	System prompt + Turn 1 user + Turn 1 assistant + new user message
Turn 3	System prompt + earlier history + new user message
Turn 10	A lot of accumulated conversation

This is fine for short chats. It becomes risky for long-running workflows.

Why? Because the model may be forced to process:

old instructions,
repeated details,
irrelevant discussion,
long tool outputs,
large documents,
intermediate reasoning,
duplicated content,
stale context.

At some point, the model's working memory gets cluttered.

PM takeaway

The issue is not only hitting the maximum limit. Performance can degrade before the limit — when context gets crowded.

6. Bigger Context Is Useful, But Not Automatically Better

A larger context window allows the model to handle longer prompts, larger documents, bigger codebases, and more complex workflows. That sounds great.

But bigger context is not automatically better.

A large context window gives capacity. It does not automatically create clarity.

PM Analogy

A large desk helps if the right documents are placed neatly. But if the desk is filled with random files, duplicate printouts, old notes, irrelevant emails, and half-read PDFs, the person working on the task may actually perform worse.

Same with LLMs.

Bad Product Thinking	Better Product Thinking
"The model supports a huge context window. Let us send everything."	"The model supports a large context window. Let us decide what deserves to enter context, in what order, and in what format."

That is context engineering

Capacity alone does not create a reliable product. Curation does.

7. Context Rot: The Hidden Product Risk

As token count grows, model performance may degrade.

Anthropic refers to this type of degradation as context rot.

In simple language

The more cluttered the context becomes, the harder it can become for the model to reliably find and use the right information.

In AI products, context rot may look like:

Symptom	What User Sees
Missed instruction	Model ignores an important rule
Lost fact	Model forgets a document detail
Weak retrieval	Model cites irrelevant context
Contradiction	Model says something inconsistent
Shallow answer	Model summarizes instead of reasoning
Bad tool choice	Agent calls the wrong tool
Workflow drift	Agent moves away from the original task

For PMs, context rot explains why long AI workflows can become unreliable even when they technically fit inside the context window.

Example: Claims Review

Suppose you pass:

full policy document,
hospital bill,
discharge summary,
previous authorization letter,
email trail,
insurer guideline,
claim notes,
OCR output,
user instructions,
processor comments.

The model may have enough token capacity to receive all of it. But will it focus on the right sections? Not automatically.

PM takeaway

You still need retrieval, prioritization, structure, and clear instructions.

8. Context Engineering: The New PM Skill

Prompt engineering gets attention, but context engineering is more important for serious AI products.

Prompt engineering asks: What instruction should we give the model?

Context engineering asks: What information should the model see before it answers?

That second question is often more important.

Context Engineering Includes

Area	PM Question
Context selection	What information should be included?
Context exclusion	What should be left out?
Ordering	What should appear first or last?
Compression	What can be summarized?
Freshness	Is this the latest source?
Authority	Which source should override others?
Relevance	Is this needed for this task?
Format	Should this be table, JSON, bullet, or text?
Lifecycle	When should old context be removed?

A PM does not need to implement tokenization logic. But a PM must define the product rules for context.

Example

Weak context design: Send all claim documents to the model.

Better context design: Send the claim type, policy summary, relevant policy clauses, latest bill extract, discharge diagnosis, previous authorization status, missing document checklist, and user's current question. Exclude unrelated email trail unless the user asks for communication history.

PM takeaway

This is the difference between a demo and a product.

9. Token Budgeting: How PMs Should Think

A token budget is a planning tool. It answers: how much of the context window should be reserved for each part of the task?

A simple token budget may look like this:

Context Component	Budget Priority
System instruction	Must include
User request	Must include
Relevant retrieved documents	Must include
Tool definitions	Include only required tools
Tool results	Include only useful results
Conversation history	Summarize when long
Output space	Reserve enough for final answer
Reasoning budget	Reserve if using extended reasoning

A common mistake is filling the input context so aggressively that the model has too little room to produce a useful answer.

Remember

The context window includes both what the model reads and what it generates.

PM Takeaway

When defining an AI feature, ask:

Question	Why It Matters
How much input context is needed?	Controls grounding
How much output is expected?	Controls user experience
How much history is useful?	Controls continuity
How much tool output should remain?	Controls agent reliability
What happens when budget is exceeded?	Controls failure behavior

Token budgeting should be part of AI product requirements.

10. Token Counting: Measurement Before Execution

Token counting is the discipline of estimating how large your prompt is before sending it to the model.

Token count affects:

cost,
latency,
rate limits,
model routing,
context overflow,
prompt optimization.

For a product team, token counting should not be treated as a developer-only diagnostic. It should support product decisions.

Product Scenario	Token Counting Use
User uploads 100-page PDF	Check whether it fits before processing
Agent uses many tools	Estimate tool-definition overhead
Chat has 50 turns	Decide when to summarize
Enterprise workflow has cost cap	Route to cheaper model when possible
User asks for detailed report	Reserve output space
Mobile AI feature	Optimize for latency

If your product has no token visibility, you are flying blind.

11. Context Overflow: What Happens When You Run Out of Room?

Context overflow happens when the input plus expected output exceeds what the model can handle. In product terms, this means the model has run out of working memory.

Bad experience: "Error: context length exceeded."

Better experience: "This document set is too large to review in one pass. I'll summarize the older documents first, then analyze the most relevant sections."

Product Fallback Options

Problem	Product Response
Too many documents	Ask user to select priority documents
Long conversation	Summarize earlier discussion
Large tool output	Keep only important fields
Too many images/PDF pages	Split into batches
Long report requested	Generate section by section
Agent workflow too long	Save state and start new context

PM takeaway

This is where AI product maturity shows. A serious AI product should not collapse when context is full — it should recover gracefully.

12. Context Management Strategies

For long-running conversations and agentic workflows, context has to be actively managed. There are three important product patterns.

Pattern 1: Summarize

Use when the conversation is long but the full detail is no longer needed.

"Summarize the first 30 turns into user goals, decisions made, unresolved questions, and next actions."

Pattern 2: Select

Use when only some documents are relevant.

Retrieve only the policy clauses related to maternity waiting period instead of sending the full policy booklet.

Pattern 3: Clear

Use when old tool outputs are no longer useful.

Remove raw API responses after extracting the final structured values.

PM Takeaway

Context management should be designed as a lifecycle.

Stage	Context Action
Start of task	Load task instructions and key data
During task	Add tool results and user clarifications
Midway	Summarize or compact old context
After tool use	Clear irrelevant raw outputs
Before final answer	Keep only evidence and decision logic
After task	Store summary and audit trail externally

Do not let context grow randomly

Random context creates random behavior.

13. Extended Thinking and Context

Some models support extended thinking, where the model can spend more tokens on internal reasoning before giving a final answer.

For PMs

Reasoning also consumes budget. If you ask a model to think deeply, you must reserve room for that thinking and for the final answer.

Where Extended Thinking Helps

Use Case	Why It Helps
Complex claim adjudication	Multiple policy checks and deductions
Codebase analysis	Need to trace dependencies
Legal clause comparison	Need careful interpretation
Financial reconciliation	Multiple calculations and exceptions
Strategic planning	Needs trade-off analysis

Where It May Be Unnecessary

Use Case	Why Not
Simple rewrite	Low reasoning need
Basic classification	Short task
FAQ answer	RAG may be enough
Short summary	Fast response preferred

PM takeaway

Do not enable deep reasoning everywhere. Use it where accuracy and complexity justify cost and latency.

14. Tool Use Adds More Context Pressure

AI agents often use tools. Tools are powerful, but they add context pressure.

A tool-using workflow may include:

tool definitions,
tool call request,
tool result,
assistant interpretation,
follow-up tool call,
additional result,
final response.

Every tool call creates context overhead.

If your AI agent calls too many tools, keeps too many raw results, or carries unnecessary history, the workflow becomes expensive and fragile.

PM Questions for Tool-Based Products

Question	Why It Matters
Which tools should be available for this task?	Too many tools confuse and increase overhead
What fields should each tool return?	Raw payloads may waste context
Should tool results be summarized?	Reduces clutter
When should tool results be cleared?	Prevents context rot
Which actions need approval?	Controls risk
What should be logged outside context?	Supports audit

PM takeaway

Tool use is not just engineering. It is product behavior design.

15. Context Awareness: Models Knowing Their Remaining Budget

Some newer models can be made aware of their remaining context budget during long workflows. This matters because a model that knows its remaining budget can behave more intelligently.

It can decide:

whether to be concise,
whether to summarize,
whether to continue,
whether to ask for prioritization,
whether to avoid unnecessary detail,
whether to save state before context runs out.

PM Takeaway

Future AI products will not only manage user journeys. They will manage context journeys.

For long-running workflows, your system should know:

State	Product Behavior
Plenty of context left	Continue normal workflow
Moderate context left	Start being selective
Low context left	Summarize and compact
Near limit	Save state and restart
Exceeded limit	Recover gracefully

Agentic products

An AI agent without context management is like an employee working on a complex case while their notes keep disappearing.

16. Long Context vs RAG: Which Should a PM Choose?

A common PM question is: If the model has a huge context window, do we still need RAG?

Yes, often you do.

Long context and RAG solve related but different problems.

Approach	Strength	Weakness
Long context	Can process a large amount of material together	Expensive, slower, risk of context rot
RAG	Retrieves focused, relevant information	Depends on retrieval quality
Summarization	Reduces volume	May lose details
Tool use	Gets exact current data	Adds workflow complexity
Memory store	Preserves long-term state	Needs retrieval and governance

Product Rule of Thumb

Use long context when the model genuinely needs to reason across many pieces together.
Use RAG when the model needs the most relevant pieces from a larger knowledge base.
Use both when the task needs broad awareness and precise grounding.

Examples

Use Case	Better Approach
Summarize one long contract	Long context
Answer policy question from 10,000 documents	RAG
Compare 20 uploaded claim documents	Long context + structured extraction
Customer support from knowledge base	RAG
Codebase migration planning	Long context + repo graph + retrieval
Agent running multi-step workflow	RAG + tools + compaction

PM takeaway

A large context window does not remove the need for product architecture.

17. How Poor Context Design Creates Hallucination

Hallucination is not only caused by weak models. It is often caused by poor context design.

Poor Context Design	Likely Failure
Missing source document	Model guesses
Too many irrelevant documents	Model focuses on wrong detail
Old policy mixed with new policy	Model gives outdated answer
No source priority	Model treats weak source as strong
Raw OCR errors included	Model trusts bad extraction
Long chat history retained	Model follows stale instruction
Tool output too large	Model misses key fields

PM Takeaway

When an AI answer is wrong, do not only blame the model. Ask:

Was the right context retrieved?
Was irrelevant context removed?
Was the source current?
Was the context structured?
Was the model told which source to trust?
Was there enough output budget?
Was the task too broad?

PM takeaway

Many AI failures are context failures.

18. Context Design for Enterprise Workflows

Enterprise AI products need stricter context design than consumer chatbots. A consumer can tolerate a slightly vague answer. A business workflow often cannot.

Example: Pre-Auth Claim Review

A good context package may include:

Context Item	Include?	Reason
Current user question	Yes	Defines task
Policy number and product type	Yes	Defines coverage context
Admission date	Yes	Needed for eligibility
Diagnosis and procedure	Yes	Needed for medical decision
Relevant policy clauses	Yes	Grounding
Full policy booklet	Maybe	Only if targeted clauses are not enough
Hospital bill line items	Yes	Needed for deductions
Previous authorization history	Yes	Avoids duplicate decisioning
Raw email trail	Maybe	Only if communication issue exists
Old irrelevant claims	No	Adds noise
Internal notes	Yes, if relevant	Operational context
Tool logs	No, unless needed	Avoid clutter

How PMs should think

Not "send everything." Not "send only the user query." Send the right context.

19. A PM Checklist for Tokens and Context Windows

Before shipping an AI feature, ask:

Question	Why It Matters
What is the maximum expected input size?	Avoid overflow
What is the expected output size?	Reserve generation budget
What documents enter context?	Control grounding
Who decides relevance?	Product or retrieval logic
What gets summarized?	Manage long sessions
What gets deleted from context?	Prevent clutter
How are old instructions handled?	Avoid conflict
How are tool results handled?	Prevent context bloat
What happens when limit is near?	Graceful fallback
How is token cost monitored?	Commercial viability
How is latency monitored?	User experience
How is context quality evaluated?	Accuracy and trust

Add these questions to your AI PRDs. Tokens and context should not be afterthoughts.

20. Final Mental Model

Tokens are the units the model reads and writes.
The context window is the model's working memory.
A large context window gives capacity, but not automatic accuracy.
More context can help, but bad context can hurt.

For product managers, the real skill is not simply asking: "How many tokens does the model support?"

The better question is: "What context does the model need to complete this task reliably?"

That is the shift. AI product quality depends on context quality.

The best AI products will not dump everything into the model. They will curate, compress, retrieve, prioritize, clear, and preserve context deliberately.

The real PM version

Tokens and context windows are not just engineering details. They are product architecture.

Chapter Summary

Concept	PM Understanding
Token	The unit of text the model processes.
Context window	The model's working memory for the current request.
Training data vs context	Training data is learned earlier; context is what the model sees now.
Token budget	Planning how much room each part of the task gets.
Token counting	Estimating prompt size before execution.
Context rot	Accuracy and recall can degrade as context grows.
Long context	Useful capacity, but not a replacement for context design.
RAG	Retrieves relevant external knowledge into context.
Compaction	Summarizes older context so long workflows can continue.
Context editing	Removes unnecessary old tool results or blocks.
Extended thinking	Uses additional output tokens for deeper reasoning.
Tool use	Adds power but also context overhead.
Context awareness	Model capability to track remaining token budget.
PM role	Design what enters context, what stays out, and what happens near limits.

Closing Thought

In traditional software, product managers designed screens, flows, fields, and rules. In AI products, product managers must also design context.

That means deciding what the model sees, what it ignores, what it retrieves, what it remembers, what it forgets, and what it should do when the working memory gets crowded.

This is not a small technical detail. This is one of the foundations of reliable AI product design.

A model with a large context window can still fail if the product gives it the wrong context. A smaller model with clean, relevant, well-structured context can often outperform a larger model drowning in noise.

The real PM lesson

Better context beats bigger context.

Chapter navigation

← Previous

Chapter 2: From Transformers to LLMs — The PM Version

How Transformer architecture became the foundation of modern AI products.

Read chapter → Next →

Chapter 4: AI Safety, RLHF, and Constitutional AI

Why safety is product architecture — not just policy — and how RLHF and Constitutional AI shape behavior.

Read chapter →

← Chapter 02 Chapter 04 → Back to Module Back to Blog AI Learning