Chapter 11 · Module 01 · Beginner–Intermediate · 26–30 min

Chapter 11: Hallucinations, Knowledge Cutoffs, and Model Limitations — The PM Version

Why strong models still fail — and how PMs design around cutoffs and limits.

Book: AI Learning Beginner–Intermediate 26–30 min
Start reading Back to module
Capability Cutoff Limits Architecture

Impressive demos are not production guarantees — design the stack around what models cannot do alone

Introduction

In Chapter 8, we covered hallucination in depth — why fluent language is not proof, how grounding and verification reduce risk, and what product controls belong in the architecture.

In Chapter 9, we covered context engineering — why bigger windows do not remove the need for curation, retrieval, and focus.

In Chapter 10, we covered how pre-training, fine-tuning, and RLHF shape what a model knows and how it behaves before your product layer runs.

Now we connect those ideas to a practical PM reality:

Even strong models have hard limits.

Demos look omniscient. Production surfaces knowledge cutoffs, stale training data, missing private facts, weak reasoning on edge cases, arithmetic slips, tool failures, and overconfident tone — often in the same workflow.

This chapter does not repeat Chapter 8’s treatment of hallucination mechanics. It explains how cutoffs and structural limitations interact with hallucination risk, and how PMs design products that assume limits instead of fighting them with bigger prompts.

The simple PM version

Capability → Cutoff → Limits → Architecture.
The model is a component, not the system of record.
Your product owns freshness, authority, verification, and fallbacks.

1. Hallucination Recap — Without Repeating Chapter 8

Hallucination is when the model states something as fact without support in provided context or verified knowledge. Chapter 8 covered causes (training data, missing context, sampling, RAG failures) and mitigations (attribution, verification, refusal, human review).

For this chapter, keep three PM anchors from Chapter 8:

  • Plausible ≠ true — fluency is not evidence.
  • Ground or verify — high-risk answers need sources or checks.
  • Product owns risk tiers — brainstorming and claims adjudication cannot share the same bar.

Cutoffs and limitations increase hallucination pressure: the model fills gaps with patterns from old training, ignores missing private data, or extrapolates beyond reasoning depth. The fix is still architectural — not “try harder to be accurate.”

PM takeaway

When you hit a cutoff or capability limit, expect hallucination-shaped failures. Design retrieval, tools, and verification for that tier.

2. Knowledge Cutoffs — What the Model Cannot Know Yet

A knowledge cutoff is the date through which a base model’s pre-training data is current. Events, regulations, product releases, and pricing after that date are not reliably “known” unless you inject them via context, tools, or retrieval.

User assumptionReality
“It’s the latest GPT — it knows 2026.”Cutoff + your context determine freshness
“It answered confidently.”Confidence does not imply post-cutoff facts
“We’ll mention the date in the prompt.”Prompts do not add missing world knowledge

PM questions to ask vendors and engineering:

  • What is the documented cutoff for each model tier we ship?
  • Which features use browsing, RAG, or tools to bypass cutoff for fresh facts?
  • What UX copy sets user expectations about timeliness?

PM takeaway

Publish cutoff-aware behavior in the PRD: when the product must retrieve, call APIs, or refuse rather than guess.

3. Private Knowledge — What Was Never in Training

Models were not trained on your member database, unreleased roadmap, internal incident postmortems, or customer-specific contract amendments. Private knowledge must enter via context, RAG, tools, or fine-tuning on permitted data — each with tradeoffs from Chapters 6 and 9.

Failure mode: the model invents a plausible internal policy ID, discount rule, or account status because the question sounds like something it should know.

Data typeTypical product path
Static internal handbookRAG + versioned index
Per-user stateTool call to system of record
Rare phrasing / labelsFine-tune or few-shot on approved examples
SecretsNever in prompts; redact in logs

PM takeaway

If the answer depends on data the model never saw, the PRD must name the injection mechanism — not assume “the AI will know.”

4. Stale Knowledge — Right Once, Wrong Now

Even before the hard cutoff, training data can be stale: old formularies, superseded APIs, deprecated SKUs, prior regulatory guidance. Stale knowledge is especially dangerous because answers sound authoritative and may have been true in training.

Stale failures show up when:

  • RAG indexes lag behind policy publishes,
  • cached embeddings serve old chunks,
  • fine-tunes encode last year’s process,
  • users upload an outdated PDF beside the current policy.

Product mitigations: version tags on sources, “effective date” in UI, scheduled re-index, tool fetch for authoritative records, and explicit stale-source warnings in support tooling.

PM takeaway

Freshness is a product SLA — index lag and cache TTL belong in requirements, not only in ops runbooks.

5. The Model Is Not the System of Record

The system of record (SOR) is where truth lives for a domain: eligibility engine, billing ledger, CRM, policy admin, content CMS. The LLM is a reasoning and language layer over that truth — not a database.

If a decision affects money, access, compliance, or safety, the SOR (or a verified read of it) must win.

LayerRole
LLMDraft, explain, route, summarize
Tools / APIsRead and write authoritative state
Human reviewApprove high-impact actions
Audit logReplay inputs, sources, and decisions

PM takeaway

Write acceptance criteria: “assistant must not finalize X without SOR confirmation.”

6. Context Dependency — Answers Only as Good as What You Send

Models do not have persistent memory of your org unless you build it. Every answer is conditioned on the current prompt, system instructions, retrieved chunks, tool outputs, and chat history in window — per Chapter 9.

Context dependency means:

  • omit the clause → wrong conclusion,
  • duplicate conflicting docs → arbitrary preference,
  • bury the signal in noise → context rot,
  • truncate history → forgotten constraints.

Limitation is not only “model IQ” — it is “what did we put in front of it this turn?”

PM takeaway

Observability must show context replay. If support cannot see retrieved chunks and tool JSON, you cannot explain wrong answers.

7. Reasoning Limits — Depth, Planning, and Edge Cases

LLMs excel at language-heavy tasks with patterns in training data. They are weaker on multi-step planning under tight constraints, novel combinatorics, and tasks requiring exhaustive case analysis without tools.

Reasoning limits appear as:

  • skipping a branch in a decision tree,
  • merging incompatible rules,
  • losing track of constraints in long chains,
  • “sounds right” logic that fails formal checks.

Mitigations: decompose into steps, structured outputs, external validators, calculators, rules engines for hard gates, human review for exceptions.

PM takeaway

Match workflow depth to model + tools. Do not ship “full autonomous adjudication” on language reasoning alone.

8. Arithmetic and Structured Computation

Models can do mental math on small numbers in demos and still fail on totals, tiered pricing, pro-rations, or ledger reconciliation. Arithmetic and tabular math should run in code, spreadsheets engines, or dedicated tools — with the model formatting inputs and explaining outputs.

TaskPrefer
Sum of line itemsCalculator tool / backend
Benefit accumulatorEligibility API
Date math across time zonesDate library in code
Explain result to userLLM narrative on verified numbers

PM takeaway

Never accept “the model calculated it” for money-moving fields without a deterministic check.

9. Instruction Limits — Following Rules Under Pressure

Instruction-tuned models follow system prompts well — until context is huge, tools return noisy payloads, users jailbreak, or conflicting rules appear in retrieved text. Instruction limits mean priority and consistency are not guaranteed across long agent runs.

Strengthen instructions with:

  • short, ordered system rules (not a 40-page policy dump),
  • structured output schemas for machine-checkable fields,
  • hard gates in code (“if missing auth doc → status = pending”),
  • separation of “style” vs “must never” rules.

PM takeaway

Critical constraints belong in code or workflow engine — not only in prose the model might dilute.

10. Prompt Sensitivity — Small Wording, Different Outcomes

Outputs shift with phrasing, example order, language, and implicit tone. Prompt sensitivity makes “works on my prompt” fragile across locales, accessibility needs, and messy real user text.

Product implications:

  • eval suites with paraphrases and typos,
  • canoncial user intents mapped to stable internal prompts,
  • guardrails for empty, hostile, or ambiguous inputs,
  • documented prompt versions tied to model version.

PM takeaway

Treat prompts like config: versioned, reviewed, and tested — not one engineer’s notebook.

11. Nondeterminism — Same Input, Different Paths

Sampling (Chapter 7), tool latency, retrieval ranking, and parallel agents introduce variance. Nondeterminism is a feature for brainstorming and a defect for regulated extraction unless bounded.

Use caseSampling stance
Marketing draftsHigher temperature OK with human edit
JSON extractionLow temperature + schema validation
Policy interpretationLow temperature + cite sources
Agent planningDeterministic tools + logged branches

PM takeaway

Define per-feature temperature and retry policy in the PRD — not one global default.

12. Overconfidence — Tone Without Calibration

RLHF-trained assistants are helpful and fluent — which users read as certainty. Overconfidence is a UX and safety problem: users skip verification when the voice sounds authoritative.

Product patterns that help:

  • explicit uncertainty language when evidence is thin,
  • confidence tied to source coverage (“2 of 5 required docs found”),
  • “verify before send” for customer-facing outputs,
  • refusal paths that explain what is missing — per Chapter 8.

PM takeaway

Design tone for trust calibration, not impressiveness. A hedged correct answer beats a confident wrong one.

13. Multimodal Limits — Vision, Audio, and Documents

Multimodal models can read images, slides, and scans — with limits on resolution, handwriting, charts, and rare layouts. Multimodal failures look like misread table cells, wrong legend colors, or skipped fine print in PDFs.

Mitigations: OCR pipeline for critical fields, human review on low-confidence parses, separate extraction agent with structured schema, never trust vision-only for dosage or legal amounts.

PM takeaway

Specify document types in scope. Out-of-scope formats should fail gracefully with human routing.

14. Tool Limits — APIs Break, Agents Loop

Tools extend the model past cutoffs — when they work. Tool limits include timeouts, partial JSON, rate limits, ambiguous schemas, and agent loops that burn tokens without progress.

  • cap tool calls per turn and per session,
  • idempotent writes with confirmation steps,
  • fallback UX when SOR is down,
  • redacted logging for PII in tool payloads.

PM takeaway

Tooling is part of reliability SLAs. “Model was fine, API failed” is still your incident.

15. Bias and Representation Limits

Training data and feedback loops encode historical bias, uneven coverage of languages and dialects, and skewed portrayals of groups. Bias affects ranking suggestions, hiring support copy, clinical triage language, and moderation — even with safety tuning.

PM responsibilities: harm tiers, prohibited use cases, eval slices by demographic proxy where appropriate, human review for sensitive domains, transparency on limitations in user-facing docs.

PM takeaway

Bias is not only an ethics ticket — it is a product quality and legal exposure issue with measurable evals.

16. Domain Limits — Expertise Boundaries

General models are broad and shallow in regulated domains unless you add retrieval, fine-tunes, and experts in the loop. Domain limits show up in medicine, law, tax, insurance, and engineering standards where wrong advice has asymmetric harm.

DomainTypical product stance
ClinicalDecision support only; no diagnosis claims
LegalResearch aid; cite sources; no filing automation without review
InsurancePolicy-grounded; SOR for eligibility
Developer toolsSuggest code; run tests in CI

PM takeaway

Scope the product claim: “assists with X under Y constraints” — not “expert in everything.”

17. Evaluation Limits — Benchmarks ≠ Your Workflow

Leaderboard scores rarely predict behavior on your documents, tools, and user phrasing. Eval limits mean you need task-specific golden sets, production shadow metrics, and regression when models or prompts change.

Minimum eval dimensions for limitation-aware products:

  • freshness (post-cutoff questions),
  • private-knowledge cases (should retrieve or refuse),
  • adversarial missing context,
  • arithmetic and structured fields,
  • tool failure injection.

PM takeaway

Ship criteria: “no release without regression on limitation suite for tier-1 flows.”

18. The Limitations Stack — How Failures Compound

Production failures are rarely one limitation. They stack:

  1. stale RAG chunk (stale knowledge),
  2. missing eligibility tool call (not SOR),
  3. long chat history noise (context dependency),
  4. confident summary (overconfidence),
  5. user trusts and acts (product risk).
LayerWhat it constrains
Training cutoffWorld facts after date
Private data pathOrg-specific truth
Context designWhat the model sees this turn
Reasoning + toolsCorrect multi-step outcomes
UX + policyWhat users are allowed to do with output

Capability → Cutoff → Limits → Architecture is how you unstack failures into owned layers.

19. Product Controls — What to Build Around Limits

Across Chapters 8–10, controls repeat. Here is the limitation-focused set:

ControlAddresses
RAG with versioned indexCutoff, stale, private docs
Tool calls to SORPrivate knowledge, arithmetic
Structured outputs + validationReasoning, instruction drift
Human review gatesDomain risk, overconfidence
Refusal + missing-info UXHallucination under uncertainty
Context replay telemetryContext dependency, eval
Risk-tiered samplingNondeterminism

PM takeaway

Pick controls per tier in the PRD — not “enable RAG” as a checkbox without freshness, auth, and attribution rules.

20. What PMs Should Not Do

MistakeWhy it fails
Assume latest model knows current eventsCutoff + missing retrieval
Use the LLM as billing or eligibility SORNot auditable; arithmetic drift
Skip freshness SLAs on RAGStale knowledge failures
One eval set from public benchmarks onlyMisses domain and tool paths
Hide limitations in marketing copyTrust collapse on first wrong answer
Rely on “be careful” in system promptInstruction limits under load
Treat hallucination chapter as enoughCutoffs need architecture too

21. Decision Framework — Five Questions per Feature

  1. What facts must be fresh? — If yes, tools/RAG/browse; never rely on weights alone.
  2. What facts are private? — Name injection path and access control.
  3. What is authoritative? — SOR wins; model drafts only.
  4. What failure is acceptable? — Sets review, refusal, and sampling.
  5. How will we detect regression? — Limitation eval suite + context replay.
If the answer needs…Lean toward…
Today’s public newsBrowse or news API + cite
Internal policy clauseVersioned RAG + effective date
Member-specific statusSOR tool call
Exact dollarsCalculator / billing engine
Creative marketingHigher sampling + human edit

22. Practical Example: Policy Q&A Assistant

Weak design: Single chat over base model; users ask “what’s our 2026 deductible policy for Plan X?”

Failure modes: pre-cutoff training guess, stale PDF in upload folder, no citation, confident wrong dollar amount.

Better design:

  • RAG over versioned policy corp with effective dates in metadata,
  • refusal when no chunk above score threshold,
  • answers show clause ID + link to source PDF,
  • human-reviewed diffs when policy team publishes updates,
  • wording in UI: “Answers use documents indexed as of [date].”

PM takeaway

Policy Q&A is a freshness and attribution product — not a general knowledge chatbot.

23. Practical Example: Prior Authorization Assistant

Weak design: Model reads free-text notes and decides approve/deny from memory of “similar cases.”

Better design:

  • tool fetch: member benefits, auth history, formulary from SOR,
  • RAG: medical necessity criteria for the requested procedure,
  • structured checklist output (missing docs, criteria met/not met/unclear),
  • arithmetic only via benefits engine — not LLM,
  • human nurse/doctor review before denial letters send,
  • audit log with context replay for appeals.

PM takeaway

Prior auth is a limitations stack in one workflow — cutoff, private data, reasoning, tools, and domain risk together.

24. Practical Example: Marketing Content Workflow

Content teams want speed. Limits still apply: brand guidelines (private), factual claims about product (cutoff/stale), regulated claims (domain).

Weak design: One-shot “write launch email for feature Y” with no grounding.

Better design:

  • RAG over approved messaging kit and legal footnotes,
  • tool pull for feature availability dates from product CMS,
  • higher sampling for variants with human editor,
  • blocked terms list enforced in post-processing,
  • separate tier: external stats require cited source snippet in draft.

Here, hallucination risk is reputational and regulatory — controls are lighter than prior auth but still explicit.

25. The PM Mental Model

Impressive capability does not remove cutoff dates, private data gaps, or reasoning boundaries.
Capability → Cutoff → Limits → Architecture.
Your product supplies what the model cannot hold alone.

StagePM owns
CapabilityChoose model tier and modality for the task
CutoffFreshness paths for time-sensitive facts
LimitsHonest scope, evals, and failure UX
ArchitectureRAG, tools, SOR, review, telemetry

Do not ship “the model will handle it.” Ship the stack that makes limits visible and recoverable.

Chapter Summary

ConceptPM understanding
Hallucination (Ch8)Plausible language without evidence — controls still apply here
Knowledge cutoffTraining ends at a date; fresh facts need injection
Private knowledgeNever trained; needs RAG, tools, or approved fine-tune
Stale knowledgeWas true once; index and cache SLAs matter
Not SORAuthoritative systems win for decisions
Context dependencyAnswers follow what you send this turn
Reasoning & arithmetic limitsDecompose; validate; use code and APIs
OverconfidenceCalibrate UX and require verification
Limitations stackFailures compound — design layers
PM roleArchitecture around cutoffs, not bigger prompts alone

Closing Thought

Users experience limitations as betrayal: “I thought it knew our policy,” “it denied auth wrong,” “the numbers didn’t match the bill.” Those moments are predictable when products treat models like omniscient teammates.

Strong PMs name limits in requirements, instrument context, tie decisions to systems of record, and align tone with evidence — building on hallucination controls from Chapter 8 without pretending one chapter solves production risk.

Module 02 opens with the model landscape — how to compare GPT, Claude, Gemini, Llama, and others on capability, cost, context, multimodality, and deployment fit for your use case.

The real PM lesson

Design for what the model cannot do alone — then choose the model that fits the stack.

Chapter navigation

← Previous

Chapter 10: Pre-training vs Fine-tuning vs RLHF — The PM Version

How each training stage shapes what a model knows and how it behaves before your product layer runs.

Read chapter →
Next →

Module 02, Chapter 1: Model Families and the Competitive Landscape — The PM Version

Map foundation families, tiers, and deployment paths — compare models like infrastructure, not fandom.

Read chapter →