Chapter 10 · Module 01 · Beginner–Intermediate · 22–26 min

Chapter 10: Pre-training vs Fine-tuning vs RLHF — The PM Version

How pre-training, fine-tuning, and RLHF turn a raw language model into a useful assistant.

Book: AI Learning Beginner–Intermediate 22–26 min
Start reading Back to module
Pre-train Fine-tune RLHF Product

Training builds capability; your product builds trust

Introduction

In Chapter 5, we walked through InstructGPT and the RLHF pipeline — how human feedback turns a base model into an instruction-following assistant.

In Chapter 6, we compared prompting, fine-tuning, RAG, and tools — the product layers you control after a vendor ships a model.

In Chapter 8, we covered hallucination — why fluent text is not proof, and how grounding and verification reduce risk.

In Chapter 9, we covered long context — capacity vs focus, and why bigger windows do not replace context design.

Now we zoom out to the training stack that happens before your product ever sends a prompt:

Pre-training, fine-tuning, and RLHF — what each stage actually changes.

PMs hear these terms in roadmap reviews, vendor decks, and compliance conversations. They are often lumped together as “the model.” In practice they solve different problems: general capability, specialized behavior, and human-preferred behavior.

This chapter is not a machine learning course. It is a map of what you are buying, what you can still change in product, and what mistakes to avoid when you scope AI features.

The simple PM version

Pre-training teaches language and broad knowledge.
Fine-tuning teaches patterns and formats.
RLHF teaches what humans prefer.
Your product still owns prompts, context, tools, and verification.

1. The Simple Comparison: Three Training Layers

Think of a modern assistant as a stack. Each layer is expensive, slow to change, and owned by different teams.

StageOne-line PM meaningWho usually owns it
Pre-trainingLearn language and world patterns from huge textFoundation model lab / vendor
Fine-tuningSpecialize behavior on a smaller, curated datasetVendor or your ML platform team
RLHFOptimize for human-preferred answers at scaleVendor alignment team
Product layersPrompt, RAG, tools, policies, UI, human reviewYour product + engineering

When a user says “the model is wrong,” ask which layer failed: missing knowledge (pre-training + retrieval), wrong format (fine-tuning or prompting), unhelpful tone (RLHF or prompting), or missing live data (tools, not training).

PM takeaway

Training stages set the ceiling. Product layers determine what users actually experience in production.

2. Pre-training: Building the Base Model

Pre-training is the first large training pass. The model reads enormous amounts of text (web, books, code, etc.) and learns to predict the next token. That objective teaches grammar, reasoning patterns, style, and a wide slice of factual associations.

The output is often called a base model or foundation model. It is powerful but not yet tuned for your chat UI, brand, or workflow.

PM analogy

Pre-training is like hiring a brilliant generalist who has read the internet — but has never worked in your company, never used your forms, and does not know your escalation rules.

PropertyTypical scale
DataTrillions of tokens
ComputeVery large clusters, weeks to months
CostMillions of dollars for frontier models
PM accessUsually none — you pick a vendor model

Most product teams do not pre-train. They choose GPT, Claude, Gemini, Llama, or an enterprise-hosted variant. Your leverage is model selection, version pinning, and everything above the base weights.

3. What Pre-training Is Good For

Pre-training establishes the raw capabilities your product inherits:

  • broad language fluency across domains,
  • general reasoning and explanation ability,
  • coding, math, and structured text patterns,
  • multilingual coverage (varies by model),
  • world knowledge frozen at a training cutoff date.
Product needPre-training contribution
Draft a clear paragraphStrong — core language skill
Explain insurance concepts in plain EnglishStrong — if concepts were in training data
Know today’s claim statusWeak — needs tools or RAG
Follow your JSON schema every timeVariable — often needs fine-tuning or strict prompting

PM takeaway

Pre-training is why the model feels “smart.” It is not why the model feels “on brand” or “connected to your systems.”

4. Pre-training Limits PMs Should Plan For

Base models come with structural limits that later stages only partially fix:

LimitWhat users experienceProduct response
Knowledge cutoffOutdated facts, wrong “current” eventsRAG, tools, disclaimers — see Chapter 11
No private dataCannot know your member’s claim by defaultRetrieval + authenticated APIs
Completion biasKeeps writing instead of answeringInstruction tuning (RLHF/SFT) + prompts
Hallucination riskFluent guesses when uncertainGrounding, citations, human review — Chapter 8
Unsafe completionsHarmful or non-compliant textRLHF, filters, policy layers — Chapters 4–5

Upgrading from “8B base” to “70B base” can help quality, but it does not replace product architecture. Bigger pre-trained models still need alignment, context, and verification for enterprise workflows.

5. Fine-tuning: Specializing Behavior

Fine-tuning continues training on a smaller, curated dataset so the model adapts to a task, domain tone, or output structure. Weights change — but far less than in pre-training.

Common types product teams encounter:

  • Supervised fine-tuning (SFT) — learn from example input/output pairs,
  • Domain fine-tunes — legal, medical, code, enterprise vocabulary,
  • Adapter / LoRA fine-tunes — cheaper partial updates on top of a frozen base,
  • Distillation — train a smaller model to mimic a larger one (cost/latency play).

Chapter 6 focused on when your team should fine-tune vs prompt. This chapter focuses on what fine-tuning is in the vendor stack — and how it differs from RLHF.

Key distinction

Fine-tuning teaches patterns from examples. It is not a reliable way to inject fast-changing facts at scale.

6. What Fine-tuning Is Good For

Use caseWhy fine-tuning helps
Stable output schemaModel internalizes JSON / field layout
Consistent adjudication wordingRepeats approved phrasing patterns
Domain classificationLearns labels from historical examples
Brand voice on short repliesTone becomes default, not re-prompted every call
Lower latency via smaller specialist modelDistilled or fine-tuned mini model for one task

Fine-tuning shines when the task is repeated, measurable, and you have high-quality labeled data. If SMEs can write 500 excellent examples and evals are clear, fine-tuning may beat a giant prompt.

PM takeaway

Fine-tune for behavior and format stability — not as a substitute for a knowledge base or live APIs.

7. Fine-tuning Limits and Risks

RiskWhat goes wrong
Stale knowledge baked inModel confidently cites old policy version
Small bad datasetLearns wrong shortcuts (“always deny”)
OverfittingGreat on eval set, brittle in production
Regression on general tasksSpecialist becomes worse outside niche
Operational costRetrain cycles, versioning, monitoring overhead
Compliance driftExamples disagree with current legal text

Chapter 8 called out a common hallucination driver: using fine-tuning to add facts. Facts belong in retrieval or systems of record; fine-tuning should teach how to use evidence, not replace evidence.

If policy changes monthly, prefer RAG + citation UI over retraining monthly — unless you have a rigorous MLOps pipeline and legal sign-off on each release.

8. RLHF: Learning What Humans Prefer

RLHF (Reinforcement Learning from Human Feedback) nudges the model toward responses humans rank higher. It typically follows demonstration fine-tuning: humans write ideal answers, rank alternatives, a reward model learns preferences, then reinforcement fine-tuning (often PPO) optimizes against that reward.

Chapter 5 mapped the InstructGPT pipeline in detail. Here is the PM-level purpose: make the model act like an assistant users want to talk to — helpful, clear, appropriately cautious, and aligned with policy.

RLHF optimizes forRLHF does not primarily add
Instruction followingYour proprietary database records
Preferred tone and structureGuaranteed factual correctness
Reduced toxicity / policy violationsReal-time inventory counts
Refusal behavior when appropriatePerfect recall of 200-page PDFs without context

9. What RLHF Is Good For

  • Chat UX that feels responsive to user intent,
  • fewer blatantly unhelpful or rambling answers,
  • better refusals on unsafe or out-of-scope requests,
  • answers that win side-by-side preference tests,
  • behavior that matches vendor safety policies.

InstructGPT’s lesson for PMs still applies: a smaller RLHF-tuned model can beat a much larger raw base model on real user prompts. Alignment training can matter more than parameter count for perceived quality.

PM takeaway

When users say “GPT feels better than the raw API,” they are often describing RLHF + product layers — not pre-training alone.

10. RLHF Limits PMs Should Not Ignore

LimitProduct symptom
Helpfulness vs truth tensionModel answers instead of saying “I don’t know”
Preference biasVerbose, generic answers that score well with raters
Over-refusalBlocks legitimate business tasks
Opaque tradeoffsHard to predict change across model versions
Not your preferencesVendor alignment ≠ your brand or jurisdiction

RLHF reduces some failure modes; it does not remove hallucination. Pair vendor alignment with your evals, grounding, and human review — especially in regulated workflows.

See Chapter 4 for safety framing and Chapter 5 for pipeline detail.

11. Side-by-Side: Pre-training vs Fine-tuning vs RLHF

DimensionPre-trainingFine-tuningRLHF
Primary goalGeneral capabilitySpecialized behaviorHuman-preferred behavior
Data scaleMassive, noisy web-scaleSmaller, curatedRankings + demonstrations
Typical ownerFoundation labLab or enterprise MLLab alignment team
Changes weights?Full model from scratchPartial or full updateFurther update on top
Best for facts?Broad, dated associationsRisky for volatile factsNot designed for facts
Best for tone?Imitates internet averageCan encode brand patternsOptimizes “sounds helpful”
PM iteration speedSlow (vendor releases)Medium (retrain cycles)Slow (vendor releases)

Stack order: Pre-train → Fine-tune (often SFT) → RLHF → Your product (prompt, RAG, tools, UI).

12. Where Prompting Fits (After Training)

Prompting does not change model weights. It is still one of the highest-leverage layers for product teams — especially because you can ship prompt changes in hours, not months.

ProblemTry firstTraining-stage note
Need today’s policy clauseRAG + citationPre-training cutoff is irrelevant if retrieval is right
Need strict JSON every timeSchema + prompt; then fine-tune if still flakyFine-tuning helps when prompts max out
Model too verbosePrompt + sampling settingsRLHF may already bias toward long answers
Wrong tool choice in agentsTool design + context engineeringChapter 9 — not solved by bigger base model
Enterprise toneSystem prompt; fine-tune if stable across thousands of callsRLHF sets vendor default, not yours

Chapter 6 is the deep dive on prompting vs fine-tuning vs RAG vs tools. This chapter’s job is to clarify what the vendor already baked in before your prompt runs.

13. Practical Example: Claims Adjudication Assistant

Pre-training layer: Model understands medical terms, insurance language, and how to summarize documents.

Fine-tuning layer (optional): Internal fine-tune on historical adjudication notes so JSON shortfall codes and phrasing match your operations manual.

RLHF layer (vendor): Model follows instructions, refuses obvious harm, sounds helpful in chat.

Product layers (you):

  • RAG for current policy PDF and checklist,
  • tool calls for claim status and member eligibility,
  • system prompt for escalation and “no final denial without human,”
  • low-temperature extraction subtasks,
  • human review gate before customer-facing output.

PM takeaway

Do not fine-tune the model on policy PDFs if RAG plus citation is safer. Fine-tune on behavior when format and tone must be identical every time.

14. Practical Example: Customer Support Copilot

Pre-training layer: Drafts empathetic replies, understands product categories, multilingual support (model-dependent).

Fine-tuning layer (optional): Train on redacted ticket history so replies match macros, tagging, and resolution steps.

RLHF layer (vendor): Polite, instruction-following chat behavior; baseline safety refusals.

Product layers (you):

  • retrieve help-center articles and account facts at answer time,
  • inject CRM context in the prompt,
  • force citations for refund policy claims,
  • agent UI with edit-before-send,
  • eval suite on real ticket types weekly.

Support PMs often ask for “fine-tune on all tickets.” Engineering should ask: is the gap knowledge (retrieval), behavior (fine-tune), or preference (prompt + RLHF already there)?

15. Decision Framework: Five Questions for Roadmap Reviews

  1. What failure are we seeing? — Knowledge, format, tone, safety, or missing live data.
  2. Is it repeated at scale? — One-off fixes belong in prompts; repeated patterns may justify fine-tuning.
  3. Do we own the data? — Fine-tuning and evals need rights-cleared, representative examples.
  4. How fast does truth change? — Volatile facts → RAG/tools; stable behavior → fine-tune.
  5. What does the vendor already provide? — New base + RLHF release may solve it without custom training.
If the gap is…Lean toward…
Wrong or missing factsRAG, tools, verification (Ch 8, 11)
Wrong JSON / labels every timePrompt + schema; then fine-tune
Rude or off-policy chatPrompt + vendor model upgrade; rarely custom RLHF
Too expensive at volumeSmaller model, distillation, task routing
Context overload in agentsContext design (Ch 9), not bigger pre-train

16. Common PM Mistakes

MistakeWhy it fails
“We’ll fine-tune on the policy wiki”Stale facts; hallucination risk
“RLHF means the model is truthful”Preference ≠ verification
“Bigger pre-train fixes our agent”Context and tools may be the bottleneck
“Prompting is temporary, fine-tune is real”Prompts are permanent product surface
“We need custom RLHF”Extremely expensive; vendor defaults first
No version pin on model familySurprise behavior shifts after vendor update
Same strategy for MVP and compliance tierHigh-risk flows need stricter grounding

17. PRD Layer Map: What to Specify

When you write requirements, name the layer explicitly so engineering and legal align:

PRD sectionSpecify
Model baselineVendor, model ID, version, knowledge cutoff
Training assumptions“Uses vendor RLHF chat model” vs “custom fine-tune v3”
KnowledgeRAG sources, refresh SLA, citation required Y/N
BehaviorOutput schema, refusal rules, escalation paths
EvalsGolden sets per layer — not one vague “accuracy” metric
Change managementWhat happens when base model or fine-tune version bumps

PM takeaway

Ambiguous “improve the model” tickets become expensive. Split work by layer: retrieval bug vs prompt vs fine-tune retrain.

18. The PM Mental Model

Pre-training gives capability.
Fine-tuning gives specialization.
RLHF gives preferred assistant behavior.
Your product gives truth in context.

Pre-train Fine-tune RLHF Product
LayerYour main PM question
Pre-trainingWhich foundation model and cutoff fit our domain?
Fine-tuningDo we have stable examples worth baking into weights?
RLHFDoes the vendor chat model meet safety and UX bar?
Prompt + contextWhat must the model see this request? (Ch 9)
RAG + toolsWhere do fresh facts live? (Ch 6, 8)
VerificationWhat requires human or automated check before ship?

You rarely control pre-training or RLHF. You always control what happens after the model loads.

Chapter Summary

ConceptPM understanding
Pre-trainingBuilds general language and knowledge; vendor-scale
Fine-tuningSpecializes behavior; risky for volatile facts
RLHFAligns to human-preferred assistant behavior
Stack orderPre-train → fine-tune → RLHF → product layers
PromptingRuntime lever; complements all training stages
Claims / support patternsRAG + tools for facts; fine-tune for stable format
PM roleName the layer in PRDs, evals, and incident reviews

Closing Thought

Vendor slides blur the stack into one word: “GPT-4” or “Claude.” Your job is to unblur it for the team. When quality shifts after a model upgrade, ask whether pre-training, alignment, or your retrieval layer moved.

The organizations that ship reliable AI products treat training stages as constraints and product layers as craft — prompts, context, tools, evals, and review — built on top of whatever foundation the lab provides.

The next chapter covers hallucinations, knowledge cutoffs, and model limitations — what happens when the stack’s knowledge is wrong, missing, or out of date, and how PMs design around that reality.

The real PM lesson

Training builds the assistant. Your product decides whether users can trust it.

Chapter navigation

← Previous

Chapter 9: Long Context Window Tradeoffs — The PM Version

Why bigger context windows help — and why sending everything still fails.

Read chapter →
Next →

Chapter 11: Hallucinations, Knowledge Cutoffs, and Model Limitations — The PM Version

What happens when model knowledge is wrong, missing, or stale — and how PMs design around cutoffs and limits.

Read chapter →