Pre-training vs Fine-tuning vs RLHF — The PM Version

Introduction

In Chapter 5, we walked through InstructGPT and the RLHF pipeline — how human feedback turns a base model into an instruction-following assistant.

In Chapter 6, we compared prompting, fine-tuning, RAG, and tools — the product layers you control after a vendor ships a model.

In Chapter 8, we covered hallucination — why fluent text is not proof, and how grounding and verification reduce risk.

In Chapter 9, we covered long context — capacity vs focus, and why bigger windows do not replace context design.

Now we zoom out to the training stack that happens before your product ever sends a prompt:

Pre-training, fine-tuning, and RLHF — what each stage actually changes.

PMs hear these terms in roadmap reviews, vendor decks, and compliance conversations. They are often lumped together as “the model.” In practice they solve different problems: general capability, specialized behavior, and human-preferred behavior.

This chapter is not a machine learning course. It is a map of what you are buying, what you can still change in product, and what mistakes to avoid when you scope AI features.

The simple PM version

Pre-training teaches language and broad knowledge.
Fine-tuning teaches patterns and formats.
RLHF teaches what humans prefer.
Your product still owns prompts, context, tools, and verification.

1. The Simple Comparison: Three Training Layers

Think of a modern assistant as a stack. Each layer is expensive, slow to change, and owned by different teams.

Stage	One-line PM meaning	Who usually owns it
Pre-training	Learn language and world patterns from huge text	Foundation model lab / vendor
Fine-tuning	Specialize behavior on a smaller, curated dataset	Vendor or your ML platform team
RLHF	Optimize for human-preferred answers at scale	Vendor alignment team
Product layers	Prompt, RAG, tools, policies, UI, human review	Your product + engineering

When a user says “the model is wrong,” ask which layer failed: missing knowledge (pre-training + retrieval), wrong format (fine-tuning or prompting), unhelpful tone (RLHF or prompting), or missing live data (tools, not training).

PM takeaway

Training stages set the ceiling. Product layers determine what users actually experience in production.

2. Pre-training: Building the Base Model

Pre-training is the first large training pass. The model reads enormous amounts of text (web, books, code, etc.) and learns to predict the next token. That objective teaches grammar, reasoning patterns, style, and a wide slice of factual associations.

The output is often called a base model or foundation model. It is powerful but not yet tuned for your chat UI, brand, or workflow.

PM analogy

Pre-training is like hiring a brilliant generalist who has read the internet — but has never worked in your company, never used your forms, and does not know your escalation rules.

Property	Typical scale
Data	Trillions of tokens
Compute	Very large clusters, weeks to months
Cost	Millions of dollars for frontier models
PM access	Usually none — you pick a vendor model

Most product teams do not pre-train. They choose GPT, Claude, Gemini, Llama, or an enterprise-hosted variant. Your leverage is model selection, version pinning, and everything above the base weights.

3. What Pre-training Is Good For

Pre-training establishes the raw capabilities your product inherits:

broad language fluency across domains,
general reasoning and explanation ability,
coding, math, and structured text patterns,
multilingual coverage (varies by model),
world knowledge frozen at a training cutoff date.

Product need	Pre-training contribution
Draft a clear paragraph	Strong — core language skill
Explain insurance concepts in plain English	Strong — if concepts were in training data
Know today’s claim status	Weak — needs tools or RAG
Follow your JSON schema every time	Variable — often needs fine-tuning or strict prompting

PM takeaway

Pre-training is why the model feels “smart.” It is not why the model feels “on brand” or “connected to your systems.”

4. Pre-training Limits PMs Should Plan For

Base models come with structural limits that later stages only partially fix:

Limit	What users experience	Product response
Knowledge cutoff	Outdated facts, wrong “current” events	RAG, tools, disclaimers — see Chapter 11
No private data	Cannot know your member’s claim by default	Retrieval + authenticated APIs
Completion bias	Keeps writing instead of answering	Instruction tuning (RLHF/SFT) + prompts
Hallucination risk	Fluent guesses when uncertain	Grounding, citations, human review — Chapter 8
Unsafe completions	Harmful or non-compliant text	RLHF, filters, policy layers — Chapters 4–5

Upgrading from “8B base” to “70B base” can help quality, but it does not replace product architecture. Bigger pre-trained models still need alignment, context, and verification for enterprise workflows.

5. Fine-tuning: Specializing Behavior

Fine-tuning continues training on a smaller, curated dataset so the model adapts to a task, domain tone, or output structure. Weights change — but far less than in pre-training.

Common types product teams encounter:

Supervised fine-tuning (SFT) — learn from example input/output pairs,
Domain fine-tunes — legal, medical, code, enterprise vocabulary,
Adapter / LoRA fine-tunes — cheaper partial updates on top of a frozen base,
Distillation — train a smaller model to mimic a larger one (cost/latency play).

Chapter 6 focused on when your team should fine-tune vs prompt. This chapter focuses on what fine-tuning is in the vendor stack — and how it differs from RLHF.

Key distinction

Fine-tuning teaches patterns from examples. It is not a reliable way to inject fast-changing facts at scale.

6. What Fine-tuning Is Good For

Use case	Why fine-tuning helps
Stable output schema	Model internalizes JSON / field layout
Consistent adjudication wording	Repeats approved phrasing patterns
Domain classification	Learns labels from historical examples
Brand voice on short replies	Tone becomes default, not re-prompted every call
Lower latency via smaller specialist model	Distilled or fine-tuned mini model for one task

Fine-tuning shines when the task is repeated, measurable, and you have high-quality labeled data. If SMEs can write 500 excellent examples and evals are clear, fine-tuning may beat a giant prompt.

PM takeaway

Fine-tune for behavior and format stability — not as a substitute for a knowledge base or live APIs.

7. Fine-tuning Limits and Risks

Risk	What goes wrong
Stale knowledge baked in	Model confidently cites old policy version
Small bad dataset	Learns wrong shortcuts (“always deny”)
Overfitting	Great on eval set, brittle in production
Regression on general tasks	Specialist becomes worse outside niche
Operational cost	Retrain cycles, versioning, monitoring overhead
Compliance drift	Examples disagree with current legal text

Chapter 8 called out a common hallucination driver: using fine-tuning to add facts. Facts belong in retrieval or systems of record; fine-tuning should teach how to use evidence, not replace evidence.

If policy changes monthly, prefer RAG + citation UI over retraining monthly — unless you have a rigorous MLOps pipeline and legal sign-off on each release.

8. RLHF: Learning What Humans Prefer

RLHF (Reinforcement Learning from Human Feedback) nudges the model toward responses humans rank higher. It typically follows demonstration fine-tuning: humans write ideal answers, rank alternatives, a reward model learns preferences, then reinforcement fine-tuning (often PPO) optimizes against that reward.

Chapter 5 mapped the InstructGPT pipeline in detail. Here is the PM-level purpose: make the model act like an assistant users want to talk to — helpful, clear, appropriately cautious, and aligned with policy.

RLHF optimizes for	RLHF does not primarily add
Instruction following	Your proprietary database records
Preferred tone and structure	Guaranteed factual correctness
Reduced toxicity / policy violations	Real-time inventory counts
Refusal behavior when appropriate	Perfect recall of 200-page PDFs without context

9. What RLHF Is Good For

Chat UX that feels responsive to user intent,
fewer blatantly unhelpful or rambling answers,
better refusals on unsafe or out-of-scope requests,
answers that win side-by-side preference tests,
behavior that matches vendor safety policies.

InstructGPT’s lesson for PMs still applies: a smaller RLHF-tuned model can beat a much larger raw base model on real user prompts. Alignment training can matter more than parameter count for perceived quality.

PM takeaway

When users say “GPT feels better than the raw API,” they are often describing RLHF + product layers — not pre-training alone.

10. RLHF Limits PMs Should Not Ignore

Limit	Product symptom
Helpfulness vs truth tension	Model answers instead of saying “I don’t know”
Preference bias	Verbose, generic answers that score well with raters
Over-refusal	Blocks legitimate business tasks
Opaque tradeoffs	Hard to predict change across model versions
Not your preferences	Vendor alignment ≠ your brand or jurisdiction

RLHF reduces some failure modes; it does not remove hallucination. Pair vendor alignment with your evals, grounding, and human review — especially in regulated workflows.

See Chapter 4 for safety framing and Chapter 5 for pipeline detail.

11. Side-by-Side: Pre-training vs Fine-tuning vs RLHF

Dimension	Pre-training	Fine-tuning	RLHF
Primary goal	General capability	Specialized behavior	Human-preferred behavior
Data scale	Massive, noisy web-scale	Smaller, curated	Rankings + demonstrations
Typical owner	Foundation lab	Lab or enterprise ML	Lab alignment team
Changes weights?	Full model from scratch	Partial or full update	Further update on top
Best for facts?	Broad, dated associations	Risky for volatile facts	Not designed for facts
Best for tone?	Imitates internet average	Can encode brand patterns	Optimizes “sounds helpful”
PM iteration speed	Slow (vendor releases)	Medium (retrain cycles)	Slow (vendor releases)

Stack order: Pre-train → Fine-tune (often SFT) → RLHF → Your product (prompt, RAG, tools, UI).

12. Where Prompting Fits (After Training)

Prompting does not change model weights. It is still one of the highest-leverage layers for product teams — especially because you can ship prompt changes in hours, not months.

Problem	Try first	Training-stage note
Need today’s policy clause	RAG + citation	Pre-training cutoff is irrelevant if retrieval is right
Need strict JSON every time	Schema + prompt; then fine-tune if still flaky	Fine-tuning helps when prompts max out
Model too verbose	Prompt + sampling settings	RLHF may already bias toward long answers
Wrong tool choice in agents	Tool design + context engineering	Chapter 9 — not solved by bigger base model
Enterprise tone	System prompt; fine-tune if stable across thousands of calls	RLHF sets vendor default, not yours

Chapter 6 is the deep dive on prompting vs fine-tuning vs RAG vs tools. This chapter’s job is to clarify what the vendor already baked in before your prompt runs.

13. Practical Example: Claims Adjudication Assistant

Pre-training layer: Model understands medical terms, insurance language, and how to summarize documents.

Fine-tuning layer (optional): Internal fine-tune on historical adjudication notes so JSON shortfall codes and phrasing match your operations manual.

RLHF layer (vendor): Model follows instructions, refuses obvious harm, sounds helpful in chat.

Product layers (you):

RAG for current policy PDF and checklist,
tool calls for claim status and member eligibility,
system prompt for escalation and “no final denial without human,”
low-temperature extraction subtasks,
human review gate before customer-facing output.

PM takeaway

Do not fine-tune the model on policy PDFs if RAG plus citation is safer. Fine-tune on behavior when format and tone must be identical every time.

14. Practical Example: Customer Support Copilot

Pre-training layer: Drafts empathetic replies, understands product categories, multilingual support (model-dependent).

Fine-tuning layer (optional): Train on redacted ticket history so replies match macros, tagging, and resolution steps.

RLHF layer (vendor): Polite, instruction-following chat behavior; baseline safety refusals.

Product layers (you):

retrieve help-center articles and account facts at answer time,
inject CRM context in the prompt,
force citations for refund policy claims,
agent UI with edit-before-send,
eval suite on real ticket types weekly.

Support PMs often ask for “fine-tune on all tickets.” Engineering should ask: is the gap knowledge (retrieval), behavior (fine-tune), or preference (prompt + RLHF already there)?

15. Decision Framework: Five Questions for Roadmap Reviews

What failure are we seeing? — Knowledge, format, tone, safety, or missing live data.
Is it repeated at scale? — One-off fixes belong in prompts; repeated patterns may justify fine-tuning.
Do we own the data? — Fine-tuning and evals need rights-cleared, representative examples.
How fast does truth change? — Volatile facts → RAG/tools; stable behavior → fine-tune.
What does the vendor already provide? — New base + RLHF release may solve it without custom training.

If the gap is…	Lean toward…
Wrong or missing facts	RAG, tools, verification (Ch 8, 11)
Wrong JSON / labels every time	Prompt + schema; then fine-tune
Rude or off-policy chat	Prompt + vendor model upgrade; rarely custom RLHF
Too expensive at volume	Smaller model, distillation, task routing
Context overload in agents	Context design (Ch 9), not bigger pre-train

16. Common PM Mistakes

Mistake	Why it fails
“We’ll fine-tune on the policy wiki”	Stale facts; hallucination risk
“RLHF means the model is truthful”	Preference ≠ verification
“Bigger pre-train fixes our agent”	Context and tools may be the bottleneck
“Prompting is temporary, fine-tune is real”	Prompts are permanent product surface
“We need custom RLHF”	Extremely expensive; vendor defaults first
No version pin on model family	Surprise behavior shifts after vendor update
Same strategy for MVP and compliance tier	High-risk flows need stricter grounding

17. PRD Layer Map: What to Specify

When you write requirements, name the layer explicitly so engineering and legal align:

PRD section	Specify
Model baseline	Vendor, model ID, version, knowledge cutoff
Training assumptions	“Uses vendor RLHF chat model” vs “custom fine-tune v3”
Knowledge	RAG sources, refresh SLA, citation required Y/N
Behavior	Output schema, refusal rules, escalation paths
Evals	Golden sets per layer — not one vague “accuracy” metric
Change management	What happens when base model or fine-tune version bumps

PM takeaway

Ambiguous “improve the model” tickets become expensive. Split work by layer: retrieval bug vs prompt vs fine-tune retrain.

18. The PM Mental Model

Pre-training gives capability.
Fine-tuning gives specialization.
RLHF gives preferred assistant behavior.
Your product gives truth in context.

Pre-train → Fine-tune → RLHF → Product

Layer	Your main PM question
Pre-training	Which foundation model and cutoff fit our domain?
Fine-tuning	Do we have stable examples worth baking into weights?
RLHF	Does the vendor chat model meet safety and UX bar?
Prompt + context	What must the model see this request? (Ch 9)
RAG + tools	Where do fresh facts live? (Ch 6, 8)
Verification	What requires human or automated check before ship?

You rarely control pre-training or RLHF. You always control what happens after the model loads.

Chapter Summary

Concept	PM understanding
Pre-training	Builds general language and knowledge; vendor-scale
Fine-tuning	Specializes behavior; risky for volatile facts
RLHF	Aligns to human-preferred assistant behavior
Stack order	Pre-train → fine-tune → RLHF → product layers
Prompting	Runtime lever; complements all training stages
Claims / support patterns	RAG + tools for facts; fine-tune for stable format
PM role	Name the layer in PRDs, evals, and incident reviews

Closing Thought

Vendor slides blur the stack into one word: “GPT-4” or “Claude.” Your job is to unblur it for the team. When quality shifts after a model upgrade, ask whether pre-training, alignment, or your retrieval layer moved.

The organizations that ship reliable AI products treat training stages as constraints and product layers as craft — prompts, context, tools, evals, and review — built on top of whatever foundation the lab provides.

The next chapter covers hallucinations, knowledge cutoffs, and model limitations — what happens when the stack’s knowledge is wrong, missing, or out of date, and how PMs design around that reality.

The real PM lesson

Training builds the assistant. Your product decides whether users can trust it.

Chapter navigation

← Previous

Chapter 9: Long Context Window Tradeoffs — The PM Version

Why bigger context windows help — and why sending everything still fails.

Read chapter → Next →

Chapter 11: Hallucinations, Knowledge Cutoffs, and Model Limitations — The PM Version

What happens when model knowledge is wrong, missing, or stale — and how PMs design around cutoffs and limits.

Read chapter →

← Chapter 09 Chapter 11 → Back to Module Back to Blog AI Learning Chapter 11 →