Introduction
In Chapter 5, we walked through InstructGPT and the RLHF pipeline — how human feedback turns a base model into an instruction-following assistant.
In Chapter 6, we compared prompting, fine-tuning, RAG, and tools — the product layers you control after a vendor ships a model.
In Chapter 8, we covered hallucination — why fluent text is not proof, and how grounding and verification reduce risk.
In Chapter 9, we covered long context — capacity vs focus, and why bigger windows do not replace context design.
Now we zoom out to the training stack that happens before your product ever sends a prompt:
Pre-training, fine-tuning, and RLHF — what each stage actually changes.
PMs hear these terms in roadmap reviews, vendor decks, and compliance conversations. They are often lumped together as “the model.” In practice they solve different problems: general capability, specialized behavior, and human-preferred behavior.
This chapter is not a machine learning course. It is a map of what you are buying, what you can still change in product, and what mistakes to avoid when you scope AI features.
The simple PM version
Pre-training teaches language and broad knowledge.
Fine-tuning teaches patterns and formats.
RLHF teaches what humans prefer.
Your product still owns prompts, context, tools, and verification.
1. The Simple Comparison: Three Training Layers
Think of a modern assistant as a stack. Each layer is expensive, slow to change, and owned by different teams.
| Stage | One-line PM meaning | Who usually owns it |
|---|---|---|
| Pre-training | Learn language and world patterns from huge text | Foundation model lab / vendor |
| Fine-tuning | Specialize behavior on a smaller, curated dataset | Vendor or your ML platform team |
| RLHF | Optimize for human-preferred answers at scale | Vendor alignment team |
| Product layers | Prompt, RAG, tools, policies, UI, human review | Your product + engineering |
When a user says “the model is wrong,” ask which layer failed: missing knowledge (pre-training + retrieval), wrong format (fine-tuning or prompting), unhelpful tone (RLHF or prompting), or missing live data (tools, not training).
PM takeaway
Training stages set the ceiling. Product layers determine what users actually experience in production.
2. Pre-training: Building the Base Model
Pre-training is the first large training pass. The model reads enormous amounts of text (web, books, code, etc.) and learns to predict the next token. That objective teaches grammar, reasoning patterns, style, and a wide slice of factual associations.
The output is often called a base model or foundation model. It is powerful but not yet tuned for your chat UI, brand, or workflow.
PM analogy
Pre-training is like hiring a brilliant generalist who has read the internet — but has never worked in your company, never used your forms, and does not know your escalation rules.
| Property | Typical scale |
|---|---|
| Data | Trillions of tokens |
| Compute | Very large clusters, weeks to months |
| Cost | Millions of dollars for frontier models |
| PM access | Usually none — you pick a vendor model |
Most product teams do not pre-train. They choose GPT, Claude, Gemini, Llama, or an enterprise-hosted variant. Your leverage is model selection, version pinning, and everything above the base weights.
3. What Pre-training Is Good For
Pre-training establishes the raw capabilities your product inherits:
- broad language fluency across domains,
- general reasoning and explanation ability,
- coding, math, and structured text patterns,
- multilingual coverage (varies by model),
- world knowledge frozen at a training cutoff date.
| Product need | Pre-training contribution |
|---|---|
| Draft a clear paragraph | Strong — core language skill |
| Explain insurance concepts in plain English | Strong — if concepts were in training data |
| Know today’s claim status | Weak — needs tools or RAG |
| Follow your JSON schema every time | Variable — often needs fine-tuning or strict prompting |
PM takeaway
Pre-training is why the model feels “smart.” It is not why the model feels “on brand” or “connected to your systems.”
4. Pre-training Limits PMs Should Plan For
Base models come with structural limits that later stages only partially fix:
| Limit | What users experience | Product response |
|---|---|---|
| Knowledge cutoff | Outdated facts, wrong “current” events | RAG, tools, disclaimers — see Chapter 11 |
| No private data | Cannot know your member’s claim by default | Retrieval + authenticated APIs |
| Completion bias | Keeps writing instead of answering | Instruction tuning (RLHF/SFT) + prompts |
| Hallucination risk | Fluent guesses when uncertain | Grounding, citations, human review — Chapter 8 |
| Unsafe completions | Harmful or non-compliant text | RLHF, filters, policy layers — Chapters 4–5 |
Upgrading from “8B base” to “70B base” can help quality, but it does not replace product architecture. Bigger pre-trained models still need alignment, context, and verification for enterprise workflows.
5. Fine-tuning: Specializing Behavior
Fine-tuning continues training on a smaller, curated dataset so the model adapts to a task, domain tone, or output structure. Weights change — but far less than in pre-training.
Common types product teams encounter:
- Supervised fine-tuning (SFT) — learn from example input/output pairs,
- Domain fine-tunes — legal, medical, code, enterprise vocabulary,
- Adapter / LoRA fine-tunes — cheaper partial updates on top of a frozen base,
- Distillation — train a smaller model to mimic a larger one (cost/latency play).
Chapter 6 focused on when your team should fine-tune vs prompt. This chapter focuses on what fine-tuning is in the vendor stack — and how it differs from RLHF.
Key distinction
Fine-tuning teaches patterns from examples. It is not a reliable way to inject fast-changing facts at scale.
6. What Fine-tuning Is Good For
| Use case | Why fine-tuning helps |
|---|---|
| Stable output schema | Model internalizes JSON / field layout |
| Consistent adjudication wording | Repeats approved phrasing patterns |
| Domain classification | Learns labels from historical examples |
| Brand voice on short replies | Tone becomes default, not re-prompted every call |
| Lower latency via smaller specialist model | Distilled or fine-tuned mini model for one task |
Fine-tuning shines when the task is repeated, measurable, and you have high-quality labeled data. If SMEs can write 500 excellent examples and evals are clear, fine-tuning may beat a giant prompt.
PM takeaway
Fine-tune for behavior and format stability — not as a substitute for a knowledge base or live APIs.
7. Fine-tuning Limits and Risks
| Risk | What goes wrong |
|---|---|
| Stale knowledge baked in | Model confidently cites old policy version |
| Small bad dataset | Learns wrong shortcuts (“always deny”) |
| Overfitting | Great on eval set, brittle in production |
| Regression on general tasks | Specialist becomes worse outside niche |
| Operational cost | Retrain cycles, versioning, monitoring overhead |
| Compliance drift | Examples disagree with current legal text |
Chapter 8 called out a common hallucination driver: using fine-tuning to add facts. Facts belong in retrieval or systems of record; fine-tuning should teach how to use evidence, not replace evidence.
If policy changes monthly, prefer RAG + citation UI over retraining monthly — unless you have a rigorous MLOps pipeline and legal sign-off on each release.
8. RLHF: Learning What Humans Prefer
RLHF (Reinforcement Learning from Human Feedback) nudges the model toward responses humans rank higher. It typically follows demonstration fine-tuning: humans write ideal answers, rank alternatives, a reward model learns preferences, then reinforcement fine-tuning (often PPO) optimizes against that reward.
Chapter 5 mapped the InstructGPT pipeline in detail. Here is the PM-level purpose: make the model act like an assistant users want to talk to — helpful, clear, appropriately cautious, and aligned with policy.
| RLHF optimizes for | RLHF does not primarily add |
|---|---|
| Instruction following | Your proprietary database records |
| Preferred tone and structure | Guaranteed factual correctness |
| Reduced toxicity / policy violations | Real-time inventory counts |
| Refusal behavior when appropriate | Perfect recall of 200-page PDFs without context |
9. What RLHF Is Good For
- Chat UX that feels responsive to user intent,
- fewer blatantly unhelpful or rambling answers,
- better refusals on unsafe or out-of-scope requests,
- answers that win side-by-side preference tests,
- behavior that matches vendor safety policies.
InstructGPT’s lesson for PMs still applies: a smaller RLHF-tuned model can beat a much larger raw base model on real user prompts. Alignment training can matter more than parameter count for perceived quality.
PM takeaway
When users say “GPT feels better than the raw API,” they are often describing RLHF + product layers — not pre-training alone.
10. RLHF Limits PMs Should Not Ignore
| Limit | Product symptom |
|---|---|
| Helpfulness vs truth tension | Model answers instead of saying “I don’t know” |
| Preference bias | Verbose, generic answers that score well with raters |
| Over-refusal | Blocks legitimate business tasks |
| Opaque tradeoffs | Hard to predict change across model versions |
| Not your preferences | Vendor alignment ≠ your brand or jurisdiction |
RLHF reduces some failure modes; it does not remove hallucination. Pair vendor alignment with your evals, grounding, and human review — especially in regulated workflows.
See Chapter 4 for safety framing and Chapter 5 for pipeline detail.
11. Side-by-Side: Pre-training vs Fine-tuning vs RLHF
| Dimension | Pre-training | Fine-tuning | RLHF |
|---|---|---|---|
| Primary goal | General capability | Specialized behavior | Human-preferred behavior |
| Data scale | Massive, noisy web-scale | Smaller, curated | Rankings + demonstrations |
| Typical owner | Foundation lab | Lab or enterprise ML | Lab alignment team |
| Changes weights? | Full model from scratch | Partial or full update | Further update on top |
| Best for facts? | Broad, dated associations | Risky for volatile facts | Not designed for facts |
| Best for tone? | Imitates internet average | Can encode brand patterns | Optimizes “sounds helpful” |
| PM iteration speed | Slow (vendor releases) | Medium (retrain cycles) | Slow (vendor releases) |
Stack order: Pre-train → Fine-tune (often SFT) → RLHF → Your product (prompt, RAG, tools, UI).
12. Where Prompting Fits (After Training)
Prompting does not change model weights. It is still one of the highest-leverage layers for product teams — especially because you can ship prompt changes in hours, not months.
| Problem | Try first | Training-stage note |
|---|---|---|
| Need today’s policy clause | RAG + citation | Pre-training cutoff is irrelevant if retrieval is right |
| Need strict JSON every time | Schema + prompt; then fine-tune if still flaky | Fine-tuning helps when prompts max out |
| Model too verbose | Prompt + sampling settings | RLHF may already bias toward long answers |
| Wrong tool choice in agents | Tool design + context engineering | Chapter 9 — not solved by bigger base model |
| Enterprise tone | System prompt; fine-tune if stable across thousands of calls | RLHF sets vendor default, not yours |
Chapter 6 is the deep dive on prompting vs fine-tuning vs RAG vs tools. This chapter’s job is to clarify what the vendor already baked in before your prompt runs.
13. Practical Example: Claims Adjudication Assistant
Pre-training layer: Model understands medical terms, insurance language, and how to summarize documents.
Fine-tuning layer (optional): Internal fine-tune on historical adjudication notes so JSON shortfall codes and phrasing match your operations manual.
RLHF layer (vendor): Model follows instructions, refuses obvious harm, sounds helpful in chat.
Product layers (you):
- RAG for current policy PDF and checklist,
- tool calls for claim status and member eligibility,
- system prompt for escalation and “no final denial without human,”
- low-temperature extraction subtasks,
- human review gate before customer-facing output.
PM takeaway
Do not fine-tune the model on policy PDFs if RAG plus citation is safer. Fine-tune on behavior when format and tone must be identical every time.
14. Practical Example: Customer Support Copilot
Pre-training layer: Drafts empathetic replies, understands product categories, multilingual support (model-dependent).
Fine-tuning layer (optional): Train on redacted ticket history so replies match macros, tagging, and resolution steps.
RLHF layer (vendor): Polite, instruction-following chat behavior; baseline safety refusals.
Product layers (you):
- retrieve help-center articles and account facts at answer time,
- inject CRM context in the prompt,
- force citations for refund policy claims,
- agent UI with edit-before-send,
- eval suite on real ticket types weekly.
Support PMs often ask for “fine-tune on all tickets.” Engineering should ask: is the gap knowledge (retrieval), behavior (fine-tune), or preference (prompt + RLHF already there)?
15. Decision Framework: Five Questions for Roadmap Reviews
- What failure are we seeing? — Knowledge, format, tone, safety, or missing live data.
- Is it repeated at scale? — One-off fixes belong in prompts; repeated patterns may justify fine-tuning.
- Do we own the data? — Fine-tuning and evals need rights-cleared, representative examples.
- How fast does truth change? — Volatile facts → RAG/tools; stable behavior → fine-tune.
- What does the vendor already provide? — New base + RLHF release may solve it without custom training.
| If the gap is… | Lean toward… |
|---|---|
| Wrong or missing facts | RAG, tools, verification (Ch 8, 11) |
| Wrong JSON / labels every time | Prompt + schema; then fine-tune |
| Rude or off-policy chat | Prompt + vendor model upgrade; rarely custom RLHF |
| Too expensive at volume | Smaller model, distillation, task routing |
| Context overload in agents | Context design (Ch 9), not bigger pre-train |
16. Common PM Mistakes
| Mistake | Why it fails |
|---|---|
| “We’ll fine-tune on the policy wiki” | Stale facts; hallucination risk |
| “RLHF means the model is truthful” | Preference ≠ verification |
| “Bigger pre-train fixes our agent” | Context and tools may be the bottleneck |
| “Prompting is temporary, fine-tune is real” | Prompts are permanent product surface |
| “We need custom RLHF” | Extremely expensive; vendor defaults first |
| No version pin on model family | Surprise behavior shifts after vendor update |
| Same strategy for MVP and compliance tier | High-risk flows need stricter grounding |
17. PRD Layer Map: What to Specify
When you write requirements, name the layer explicitly so engineering and legal align:
| PRD section | Specify |
|---|---|
| Model baseline | Vendor, model ID, version, knowledge cutoff |
| Training assumptions | “Uses vendor RLHF chat model” vs “custom fine-tune v3” |
| Knowledge | RAG sources, refresh SLA, citation required Y/N |
| Behavior | Output schema, refusal rules, escalation paths |
| Evals | Golden sets per layer — not one vague “accuracy” metric |
| Change management | What happens when base model or fine-tune version bumps |
PM takeaway
Ambiguous “improve the model” tickets become expensive. Split work by layer: retrieval bug vs prompt vs fine-tune retrain.
18. The PM Mental Model
Pre-training gives capability.
Fine-tuning gives specialization.
RLHF gives preferred assistant behavior.
Your product gives truth in context.
| Layer | Your main PM question |
|---|---|
| Pre-training | Which foundation model and cutoff fit our domain? |
| Fine-tuning | Do we have stable examples worth baking into weights? |
| RLHF | Does the vendor chat model meet safety and UX bar? |
| Prompt + context | What must the model see this request? (Ch 9) |
| RAG + tools | Where do fresh facts live? (Ch 6, 8) |
| Verification | What requires human or automated check before ship? |
You rarely control pre-training or RLHF. You always control what happens after the model loads.
Chapter Summary
| Concept | PM understanding |
|---|---|
| Pre-training | Builds general language and knowledge; vendor-scale |
| Fine-tuning | Specializes behavior; risky for volatile facts |
| RLHF | Aligns to human-preferred assistant behavior |
| Stack order | Pre-train → fine-tune → RLHF → product layers |
| Prompting | Runtime lever; complements all training stages |
| Claims / support patterns | RAG + tools for facts; fine-tune for stable format |
| PM role | Name the layer in PRDs, evals, and incident reviews |
Closing Thought
Vendor slides blur the stack into one word: “GPT-4” or “Claude.” Your job is to unblur it for the team. When quality shifts after a model upgrade, ask whether pre-training, alignment, or your retrieval layer moved.
The organizations that ship reliable AI products treat training stages as constraints and product layers as craft — prompts, context, tools, evals, and review — built on top of whatever foundation the lab provides.
The next chapter covers hallucinations, knowledge cutoffs, and model limitations — what happens when the stack’s knowledge is wrong, missing, or out of date, and how PMs design around that reality.
The real PM lesson
Training builds the assistant. Your product decides whether users can trust it.
Chapter navigation
Chapter 9: Long Context Window Tradeoffs — The PM Version
Why bigger context windows help — and why sending everything still fails.
Read chapter → Next →Chapter 11: Hallucinations, Knowledge Cutoffs, and Model Limitations — The PM Version
What happens when model knowledge is wrong, missing, or stale — and how PMs design around cutoffs and limits.
Read chapter →