Introduction
In Chapter 8, we covered hallucination in depth — why fluent language is not proof, how grounding and verification reduce risk, and what product controls belong in the architecture.
In Chapter 9, we covered context engineering — why bigger windows do not remove the need for curation, retrieval, and focus.
In Chapter 10, we covered how pre-training, fine-tuning, and RLHF shape what a model knows and how it behaves before your product layer runs.
Now we connect those ideas to a practical PM reality:
Even strong models have hard limits.
Demos look omniscient. Production surfaces knowledge cutoffs, stale training data, missing private facts, weak reasoning on edge cases, arithmetic slips, tool failures, and overconfident tone — often in the same workflow.
This chapter does not repeat Chapter 8’s treatment of hallucination mechanics. It explains how cutoffs and structural limitations interact with hallucination risk, and how PMs design products that assume limits instead of fighting them with bigger prompts.
The simple PM version
Capability → Cutoff → Limits → Architecture.
The model is a component, not the system of record.
Your product owns freshness, authority, verification, and fallbacks.
1. Hallucination Recap — Without Repeating Chapter 8
Hallucination is when the model states something as fact without support in provided context or verified knowledge. Chapter 8 covered causes (training data, missing context, sampling, RAG failures) and mitigations (attribution, verification, refusal, human review).
For this chapter, keep three PM anchors from Chapter 8:
- Plausible ≠ true — fluency is not evidence.
- Ground or verify — high-risk answers need sources or checks.
- Product owns risk tiers — brainstorming and claims adjudication cannot share the same bar.
Cutoffs and limitations increase hallucination pressure: the model fills gaps with patterns from old training, ignores missing private data, or extrapolates beyond reasoning depth. The fix is still architectural — not “try harder to be accurate.”
PM takeaway
When you hit a cutoff or capability limit, expect hallucination-shaped failures. Design retrieval, tools, and verification for that tier.
2. Knowledge Cutoffs — What the Model Cannot Know Yet
A knowledge cutoff is the date through which a base model’s pre-training data is current. Events, regulations, product releases, and pricing after that date are not reliably “known” unless you inject them via context, tools, or retrieval.
| User assumption | Reality |
|---|---|
| “It’s the latest GPT — it knows 2026.” | Cutoff + your context determine freshness |
| “It answered confidently.” | Confidence does not imply post-cutoff facts |
| “We’ll mention the date in the prompt.” | Prompts do not add missing world knowledge |
PM questions to ask vendors and engineering:
- What is the documented cutoff for each model tier we ship?
- Which features use browsing, RAG, or tools to bypass cutoff for fresh facts?
- What UX copy sets user expectations about timeliness?
PM takeaway
Publish cutoff-aware behavior in the PRD: when the product must retrieve, call APIs, or refuse rather than guess.
3. Private Knowledge — What Was Never in Training
Models were not trained on your member database, unreleased roadmap, internal incident postmortems, or customer-specific contract amendments. Private knowledge must enter via context, RAG, tools, or fine-tuning on permitted data — each with tradeoffs from Chapters 6 and 9.
Failure mode: the model invents a plausible internal policy ID, discount rule, or account status because the question sounds like something it should know.
| Data type | Typical product path |
|---|---|
| Static internal handbook | RAG + versioned index |
| Per-user state | Tool call to system of record |
| Rare phrasing / labels | Fine-tune or few-shot on approved examples |
| Secrets | Never in prompts; redact in logs |
PM takeaway
If the answer depends on data the model never saw, the PRD must name the injection mechanism — not assume “the AI will know.”
4. Stale Knowledge — Right Once, Wrong Now
Even before the hard cutoff, training data can be stale: old formularies, superseded APIs, deprecated SKUs, prior regulatory guidance. Stale knowledge is especially dangerous because answers sound authoritative and may have been true in training.
Stale failures show up when:
- RAG indexes lag behind policy publishes,
- cached embeddings serve old chunks,
- fine-tunes encode last year’s process,
- users upload an outdated PDF beside the current policy.
Product mitigations: version tags on sources, “effective date” in UI, scheduled re-index, tool fetch for authoritative records, and explicit stale-source warnings in support tooling.
PM takeaway
Freshness is a product SLA — index lag and cache TTL belong in requirements, not only in ops runbooks.
5. The Model Is Not the System of Record
The system of record (SOR) is where truth lives for a domain: eligibility engine, billing ledger, CRM, policy admin, content CMS. The LLM is a reasoning and language layer over that truth — not a database.
If a decision affects money, access, compliance, or safety, the SOR (or a verified read of it) must win.
| Layer | Role |
|---|---|
| LLM | Draft, explain, route, summarize |
| Tools / APIs | Read and write authoritative state |
| Human review | Approve high-impact actions |
| Audit log | Replay inputs, sources, and decisions |
PM takeaway
Write acceptance criteria: “assistant must not finalize X without SOR confirmation.”
6. Context Dependency — Answers Only as Good as What You Send
Models do not have persistent memory of your org unless you build it. Every answer is conditioned on the current prompt, system instructions, retrieved chunks, tool outputs, and chat history in window — per Chapter 9.
Context dependency means:
- omit the clause → wrong conclusion,
- duplicate conflicting docs → arbitrary preference,
- bury the signal in noise → context rot,
- truncate history → forgotten constraints.
Limitation is not only “model IQ” — it is “what did we put in front of it this turn?”
PM takeaway
Observability must show context replay. If support cannot see retrieved chunks and tool JSON, you cannot explain wrong answers.
7. Reasoning Limits — Depth, Planning, and Edge Cases
LLMs excel at language-heavy tasks with patterns in training data. They are weaker on multi-step planning under tight constraints, novel combinatorics, and tasks requiring exhaustive case analysis without tools.
Reasoning limits appear as:
- skipping a branch in a decision tree,
- merging incompatible rules,
- losing track of constraints in long chains,
- “sounds right” logic that fails formal checks.
Mitigations: decompose into steps, structured outputs, external validators, calculators, rules engines for hard gates, human review for exceptions.
PM takeaway
Match workflow depth to model + tools. Do not ship “full autonomous adjudication” on language reasoning alone.
8. Arithmetic and Structured Computation
Models can do mental math on small numbers in demos and still fail on totals, tiered pricing, pro-rations, or ledger reconciliation. Arithmetic and tabular math should run in code, spreadsheets engines, or dedicated tools — with the model formatting inputs and explaining outputs.
| Task | Prefer |
|---|---|
| Sum of line items | Calculator tool / backend |
| Benefit accumulator | Eligibility API |
| Date math across time zones | Date library in code |
| Explain result to user | LLM narrative on verified numbers |
PM takeaway
Never accept “the model calculated it” for money-moving fields without a deterministic check.
9. Instruction Limits — Following Rules Under Pressure
Instruction-tuned models follow system prompts well — until context is huge, tools return noisy payloads, users jailbreak, or conflicting rules appear in retrieved text. Instruction limits mean priority and consistency are not guaranteed across long agent runs.
Strengthen instructions with:
- short, ordered system rules (not a 40-page policy dump),
- structured output schemas for machine-checkable fields,
- hard gates in code (“if missing auth doc → status = pending”),
- separation of “style” vs “must never” rules.
PM takeaway
Critical constraints belong in code or workflow engine — not only in prose the model might dilute.
10. Prompt Sensitivity — Small Wording, Different Outcomes
Outputs shift with phrasing, example order, language, and implicit tone. Prompt sensitivity makes “works on my prompt” fragile across locales, accessibility needs, and messy real user text.
Product implications:
- eval suites with paraphrases and typos,
- canoncial user intents mapped to stable internal prompts,
- guardrails for empty, hostile, or ambiguous inputs,
- documented prompt versions tied to model version.
PM takeaway
Treat prompts like config: versioned, reviewed, and tested — not one engineer’s notebook.
11. Nondeterminism — Same Input, Different Paths
Sampling (Chapter 7), tool latency, retrieval ranking, and parallel agents introduce variance. Nondeterminism is a feature for brainstorming and a defect for regulated extraction unless bounded.
| Use case | Sampling stance |
|---|---|
| Marketing drafts | Higher temperature OK with human edit |
| JSON extraction | Low temperature + schema validation |
| Policy interpretation | Low temperature + cite sources |
| Agent planning | Deterministic tools + logged branches |
PM takeaway
Define per-feature temperature and retry policy in the PRD — not one global default.
12. Overconfidence — Tone Without Calibration
RLHF-trained assistants are helpful and fluent — which users read as certainty. Overconfidence is a UX and safety problem: users skip verification when the voice sounds authoritative.
Product patterns that help:
- explicit uncertainty language when evidence is thin,
- confidence tied to source coverage (“2 of 5 required docs found”),
- “verify before send” for customer-facing outputs,
- refusal paths that explain what is missing — per Chapter 8.
PM takeaway
Design tone for trust calibration, not impressiveness. A hedged correct answer beats a confident wrong one.
13. Multimodal Limits — Vision, Audio, and Documents
Multimodal models can read images, slides, and scans — with limits on resolution, handwriting, charts, and rare layouts. Multimodal failures look like misread table cells, wrong legend colors, or skipped fine print in PDFs.
Mitigations: OCR pipeline for critical fields, human review on low-confidence parses, separate extraction agent with structured schema, never trust vision-only for dosage or legal amounts.
PM takeaway
Specify document types in scope. Out-of-scope formats should fail gracefully with human routing.
14. Tool Limits — APIs Break, Agents Loop
Tools extend the model past cutoffs — when they work. Tool limits include timeouts, partial JSON, rate limits, ambiguous schemas, and agent loops that burn tokens without progress.
- cap tool calls per turn and per session,
- idempotent writes with confirmation steps,
- fallback UX when SOR is down,
- redacted logging for PII in tool payloads.
PM takeaway
Tooling is part of reliability SLAs. “Model was fine, API failed” is still your incident.
15. Bias and Representation Limits
Training data and feedback loops encode historical bias, uneven coverage of languages and dialects, and skewed portrayals of groups. Bias affects ranking suggestions, hiring support copy, clinical triage language, and moderation — even with safety tuning.
PM responsibilities: harm tiers, prohibited use cases, eval slices by demographic proxy where appropriate, human review for sensitive domains, transparency on limitations in user-facing docs.
PM takeaway
Bias is not only an ethics ticket — it is a product quality and legal exposure issue with measurable evals.
16. Domain Limits — Expertise Boundaries
General models are broad and shallow in regulated domains unless you add retrieval, fine-tunes, and experts in the loop. Domain limits show up in medicine, law, tax, insurance, and engineering standards where wrong advice has asymmetric harm.
| Domain | Typical product stance |
|---|---|
| Clinical | Decision support only; no diagnosis claims |
| Legal | Research aid; cite sources; no filing automation without review |
| Insurance | Policy-grounded; SOR for eligibility |
| Developer tools | Suggest code; run tests in CI |
PM takeaway
Scope the product claim: “assists with X under Y constraints” — not “expert in everything.”
17. Evaluation Limits — Benchmarks ≠ Your Workflow
Leaderboard scores rarely predict behavior on your documents, tools, and user phrasing. Eval limits mean you need task-specific golden sets, production shadow metrics, and regression when models or prompts change.
Minimum eval dimensions for limitation-aware products:
- freshness (post-cutoff questions),
- private-knowledge cases (should retrieve or refuse),
- adversarial missing context,
- arithmetic and structured fields,
- tool failure injection.
PM takeaway
Ship criteria: “no release without regression on limitation suite for tier-1 flows.”
18. The Limitations Stack — How Failures Compound
Production failures are rarely one limitation. They stack:
- stale RAG chunk (stale knowledge),
- missing eligibility tool call (not SOR),
- long chat history noise (context dependency),
- confident summary (overconfidence),
- user trusts and acts (product risk).
| Layer | What it constrains |
|---|---|
| Training cutoff | World facts after date |
| Private data path | Org-specific truth |
| Context design | What the model sees this turn |
| Reasoning + tools | Correct multi-step outcomes |
| UX + policy | What users are allowed to do with output |
Capability → Cutoff → Limits → Architecture is how you unstack failures into owned layers.
19. Product Controls — What to Build Around Limits
Across Chapters 8–10, controls repeat. Here is the limitation-focused set:
| Control | Addresses |
|---|---|
| RAG with versioned index | Cutoff, stale, private docs |
| Tool calls to SOR | Private knowledge, arithmetic |
| Structured outputs + validation | Reasoning, instruction drift |
| Human review gates | Domain risk, overconfidence |
| Refusal + missing-info UX | Hallucination under uncertainty |
| Context replay telemetry | Context dependency, eval |
| Risk-tiered sampling | Nondeterminism |
PM takeaway
Pick controls per tier in the PRD — not “enable RAG” as a checkbox without freshness, auth, and attribution rules.
20. What PMs Should Not Do
| Mistake | Why it fails |
|---|---|
| Assume latest model knows current events | Cutoff + missing retrieval |
| Use the LLM as billing or eligibility SOR | Not auditable; arithmetic drift |
| Skip freshness SLAs on RAG | Stale knowledge failures |
| One eval set from public benchmarks only | Misses domain and tool paths |
| Hide limitations in marketing copy | Trust collapse on first wrong answer |
| Rely on “be careful” in system prompt | Instruction limits under load |
| Treat hallucination chapter as enough | Cutoffs need architecture too |
21. Decision Framework — Five Questions per Feature
- What facts must be fresh? — If yes, tools/RAG/browse; never rely on weights alone.
- What facts are private? — Name injection path and access control.
- What is authoritative? — SOR wins; model drafts only.
- What failure is acceptable? — Sets review, refusal, and sampling.
- How will we detect regression? — Limitation eval suite + context replay.
| If the answer needs… | Lean toward… |
|---|---|
| Today’s public news | Browse or news API + cite |
| Internal policy clause | Versioned RAG + effective date |
| Member-specific status | SOR tool call |
| Exact dollars | Calculator / billing engine |
| Creative marketing | Higher sampling + human edit |
22. Practical Example: Policy Q&A Assistant
Weak design: Single chat over base model; users ask “what’s our 2026 deductible policy for Plan X?”
Failure modes: pre-cutoff training guess, stale PDF in upload folder, no citation, confident wrong dollar amount.
Better design:
- RAG over versioned policy corp with effective dates in metadata,
- refusal when no chunk above score threshold,
- answers show clause ID + link to source PDF,
- human-reviewed diffs when policy team publishes updates,
- wording in UI: “Answers use documents indexed as of [date].”
PM takeaway
Policy Q&A is a freshness and attribution product — not a general knowledge chatbot.
23. Practical Example: Prior Authorization Assistant
Weak design: Model reads free-text notes and decides approve/deny from memory of “similar cases.”
Better design:
- tool fetch: member benefits, auth history, formulary from SOR,
- RAG: medical necessity criteria for the requested procedure,
- structured checklist output (missing docs, criteria met/not met/unclear),
- arithmetic only via benefits engine — not LLM,
- human nurse/doctor review before denial letters send,
- audit log with context replay for appeals.
PM takeaway
Prior auth is a limitations stack in one workflow — cutoff, private data, reasoning, tools, and domain risk together.
24. Practical Example: Marketing Content Workflow
Content teams want speed. Limits still apply: brand guidelines (private), factual claims about product (cutoff/stale), regulated claims (domain).
Weak design: One-shot “write launch email for feature Y” with no grounding.
Better design:
- RAG over approved messaging kit and legal footnotes,
- tool pull for feature availability dates from product CMS,
- higher sampling for variants with human editor,
- blocked terms list enforced in post-processing,
- separate tier: external stats require cited source snippet in draft.
Here, hallucination risk is reputational and regulatory — controls are lighter than prior auth but still explicit.
25. The PM Mental Model
Impressive capability does not remove cutoff dates, private data gaps, or reasoning boundaries.
Capability → Cutoff → Limits → Architecture.
Your product supplies what the model cannot hold alone.
| Stage | PM owns |
|---|---|
| Capability | Choose model tier and modality for the task |
| Cutoff | Freshness paths for time-sensitive facts |
| Limits | Honest scope, evals, and failure UX |
| Architecture | RAG, tools, SOR, review, telemetry |
Do not ship “the model will handle it.” Ship the stack that makes limits visible and recoverable.
Chapter Summary
| Concept | PM understanding |
|---|---|
| Hallucination (Ch8) | Plausible language without evidence — controls still apply here |
| Knowledge cutoff | Training ends at a date; fresh facts need injection |
| Private knowledge | Never trained; needs RAG, tools, or approved fine-tune |
| Stale knowledge | Was true once; index and cache SLAs matter |
| Not SOR | Authoritative systems win for decisions |
| Context dependency | Answers follow what you send this turn |
| Reasoning & arithmetic limits | Decompose; validate; use code and APIs |
| Overconfidence | Calibrate UX and require verification |
| Limitations stack | Failures compound — design layers |
| PM role | Architecture around cutoffs, not bigger prompts alone |
Closing Thought
Users experience limitations as betrayal: “I thought it knew our policy,” “it denied auth wrong,” “the numbers didn’t match the bill.” Those moments are predictable when products treat models like omniscient teammates.
Strong PMs name limits in requirements, instrument context, tie decisions to systems of record, and align tone with evidence — building on hallucination controls from Chapter 8 without pretending one chapter solves production risk.
Module 02 opens with the model landscape — how to compare GPT, Claude, Gemini, Llama, and others on capability, cost, context, multimodality, and deployment fit for your use case.
The real PM lesson
Design for what the model cannot do alone — then choose the model that fits the stack.
Chapter navigation
Chapter 10: Pre-training vs Fine-tuning vs RLHF — The PM Version
How each training stage shapes what a model knows and how it behaves before your product layer runs.
Read chapter → Next →Module 02, Chapter 1: Model Families and the Competitive Landscape — The PM Version
Map foundation families, tiers, and deployment paths — compare models like infrastructure, not fandom.
Read chapter →