Introduction
In Chapter 2, we saw how transformers became Large Language Models.
In Chapter 3, we covered tokens and context windows — the working memory of an AI system.
In Chapter 4, we introduced AI safety, RLHF, and Constitutional AI at a high level — why alignment matters, what human feedback does, and how principled training complements preference learning.
This chapter goes deeper into the work that made modern assistants feel usable: InstructGPT and the RLHF pipeline behind it.
If Chapter 4 answered “Why should PMs care about alignment and feedback?”, Chapter 5 answers “How did OpenAI actually turn GPT-3 into something that follows instructions?”
Many PMs hear “RLHF” as a safety buzzword. That is only half the story. InstructGPT showed that human feedback is also a product behavior layer — the step that converts raw language capability into assistants people want to use.
The simple PM version
Base models predict text.
RLHF-trained models try to do what users actually asked for.
1. The Base Model Problem
A base model (like early GPT-3) is trained to predict the next token on large text from the internet and books. It learns language, facts, style, and patterns — but not necessarily how to behave as your product’s assistant.
When you give a base model a user instruction, common failure modes include:
| User intent | Base model tendency |
|---|---|
| “Summarize this in 3 bullets.” | Continues the document instead of summarizing |
| “Answer as a claims expert.” | Drifts into generic or theatrical tone |
| “Say you don’t know if evidence is missing.” | Still sounds confident |
| “Follow this JSON schema.” | Produces almost-right structure with subtle errors |
| “Refuse harmful requests.” | Sometimes complies because completion patterns dominate |
| “Be concise.” | Rambles because longer continuations were common in training |
For PMs, the base model problem is a product gap: capability ≠ usable behavior. A model can score well on knowledge benchmarks and still feel broken in a chat UI.
Pre-ChatGPT era teams learned this quickly: impressive demos on completion APIs did not automatically translate into reliable workflow tools. Users do not want “the most likely next paragraph.” They want an answer that matches the task.
2. What InstructGPT Fixed
OpenAI’s InstructGPT work (2022) asked a practical question: how do we align language models so they follow instructions, tell the truth more often, and cause less harm — while staying useful?
The headline product outcome was not a bigger model. It was a better-behaved model:
- Follows explicit instructions more reliably,
- Produces answers humans prefer in side-by-side comparisons,
- Reduces toxic or unhelpful outputs,
- Improves truthfulness on some factual prompts,
- Feels more like an assistant and less like autocomplete.
InstructGPT did not replace pre-training. It added training stages on top of a capable base model, using human demonstrations and human rankings to teach how to respond — not just what text is statistically likely.
PM takeaway
The breakthrough was behavioral alignment for real user prompts — turning “language model” into “instruction-following product surface.”
Further reading
OpenAI: Aligning language models to follow instructions — the original InstructGPT announcement and results.
3. The Size Lesson PMs Should Remember
One of the most important InstructGPT findings for product leaders: a smaller aligned model can beat a much larger base model in human preference.
In OpenAI’s comparisons, labelers often preferred a 1.3B-parameter InstructGPT-style model over the raw 175B GPT-3 base model on prompts submitted by real API users. Bigger was not automatically better for the product experience.
| What teams assume | What InstructGPT showed |
|---|---|
| Largest model wins every UX test | Alignment training can matter more than raw scale |
| “Upgrade to 175B” fixes quality | Behavior tuning may deliver more user-visible lift |
| Cost is only about parameters | Preference training + inference policy shape unit economics |
| Benchmarks predict satisfaction | Human preference on real prompts is a different metric |
PM takeaway
Before you pay for the biggest model tier, ask whether the product needs more knowledge or better instruction-following behavior. Often the answer is both — but behavior tuning is not optional.
4. RLHF in Simple Terms
RLHF (Reinforcement Learning from Human Feedback) means: use human judgments about which outputs are better, train a scoring model from those judgments, then nudge the language model toward higher-scoring behavior.
At a product level, think of three layers:
- Show the model good examples (supervised fine-tuning on demonstrations).
- Teach a critic what humans prefer (reward model from rankings).
- Practice under that critic (reinforcement fine-tuning, often PPO).
Chapter 4 explained RLHF in the safety frame — helpful, honest, harmless. InstructGPT adds the product frame: RLHF is how you make the model follow the user’s task, not only avoid harm.
Key point
RLHF shapes behavior. It does not primarily teach new facts from scratch.
5. The InstructGPT Pipeline (PM Map)
You do not need to implement PPO to be a strong AI PM. You do need to know the pipeline your vendor (or internal ML team) is approximating.
| Stage | Input | Output | PM translation |
|---|---|---|---|
| Pre-training | Massive text corpora | Base LLM | General capability |
| SFT | Human-written ideal responses | Instruction-tuned model | “Here is how we want answers to look.” |
| Reward model training | Human rankings of multiple answers | Preference scorer | “This answer is better than that one.” |
| RL fine-tuning (PPO) | Prompts + reward signal | Aligned assistant | “Optimize for preferred behavior at scale.” |
| Deployment layers | Policies, filters, tools, evals | Shipped product | “Behavior in production, not only in lab.” |
InstructGPT’s public workflow used real prompts from API customers, human labelers, comparison data, and iterative evaluation. That is why the resulting model felt closer to production reality than a benchmark-only fine-tune.
Further reading
For the research narrative and metrics, see OpenAI’s Aligning language models to follow instructions.
6. Supervised Fine-Tuning (SFT)
Supervised fine-tuning is the teaching phase where humans write high-quality answers to prompts. The model learns to imitate those demonstrations — format, tone, structure, refusal style, and task completion patterns.
What SFT is good at
| Behavior | Why it matters in product |
|---|---|
| Instruction format | Answers the question asked, not a random continuation |
| Role adherence | Stays in “support agent,” “analyst,” or “coach” mode |
| Output structure | Bullets, tables, steps, JSON-like patterns |
| Domain tone | Professional vs casual as required |
| Baseline safety style | How refusals and caveats should read |
What SFT does not guarantee
- Perfect factual accuracy on rare facts,
- Robustness to adversarial prompts,
- Optimal tradeoffs when instructions conflict,
- Behavior on prompts unlike the demonstration set.
For enterprise PMs, SFT is closest to a golden response library — curated prompt/answer pairs that encode how your product should behave in core workflows.
PM takeaway
SFT is “show, don’t tell.” Quality of demonstrations becomes quality of default behavior.
7. Rankings and Preference Data
After SFT, InstructGPT-style training collects comparisons: for the same prompt, labelers rank multiple model outputs from best to worst.
Preference data captures nuances demonstrations miss:
| Comparison dimension | Example |
|---|---|
| Helpfulness | Complete answer vs vague answer |
| Clarity | Structured explanation vs wall of text |
| Truthfulness | States uncertainty vs invents details |
| Safety | Appropriate refusal vs unsafe compliance |
| Instruction match | Follows constraints vs ignores them |
| Verbosity | Right length for the task |
Rankings scale better than rewriting every answer from scratch. They also encode tradeoffs: “Answer B is more helpful but slightly longer — still preferred.”
Product teams can mirror this internally with side-by-side evals: same prompt, two configurations, human or expert judgment on which output wins for the workflow.
8. Reward Models
A reward model learns to predict human preference scores from comparison data. It becomes an automated critic: given a prompt and a candidate answer, it outputs how “good” that answer is likely to be.
PM-friendly properties of reward models:
- They compress thousands of human judgments into a reusable scoring function,
- They enable reinforcement-style training at scale,
- They make behavior objectives more explicit than “vibes from prompt engineering,”
- They can be evaluated — if the reward model is wrong, alignment drifts.
| Risk | Product symptom |
|---|---|
| Reward hacking | Model optimizes for sounding good, not being correct |
| Short prompt bias | Great on demos, weak on long enterprise prompts |
| Labeler mismatch | Behavior fits generic users, not your domain experts |
| Stale preferences | Model feels outdated vs current policy or UI |
PM takeaway
The reward model is a proxy for your product’s definition of “better.” If that definition is vague, alignment will be vague too.
9. PPO Fine-Tuning (What PMs Need to Know)
PPO (Proximal Policy Optimization) is a reinforcement learning algorithm used to update the language model using reward model scores. You do not need the math — you need the product intuition.
During this stage, the model generates answers, the reward model scores them, and training pushes the model toward higher rewards — within guardrails so it does not drift too far from the SFT model (which would break fluency or capabilities).
| Concept | Plain-language meaning |
|---|---|
| Policy | The model’s current behavior policy (how it answers) |
| Reward signal | Automated score from human preference training |
| KL penalty / constraint | Don’t change behavior too wildly in one update |
| Iteration | Repeated generate → score → adjust cycles |
For PMs, PPO is why “aligned” models can feel more polished than SFT-only models — they have been optimized on many prompt variations against a preference objective, not only copied from static examples.
10. RLHF Is Not Just a Safety Layer
Safety teams care about RLHF because it reduces toxicity and improves refusal behavior. Product teams should care because it improves task success rate.
| RLHF effect | Product impact |
|---|---|
| Better instruction following | Fewer “it ignored my request” tickets |
| Better formatting | Easier downstream parsing and UI rendering |
| Better tone control | Brand-consistent customer experiences |
| Better calibration | More appropriate confidence and caveats |
| Better refusals | Less legal/reputational exposure |
| Better user preference match | Higher retention and task completion |
Treating RLHF as “the safety team’s problem” leads to weak PRDs. Treating RLHF as “how we define and optimize preferred behavior” leads to measurable product quality.
11. How RLHF Unlocks Capability Users Actually Feel
Pre-training gives the model knowledge and language skill. RLHF makes that skill callable through natural instructions.
Examples of “unlocked” product behaviors:
- Turning a messy email thread into action items with owners and dates,
- Rewriting technical content for a non-technical executive,
- Following “use only the attached policy PDF” more reliably,
- Producing compare/contrast tables instead of narrative drift,
- Stopping when the user asked for a one-sentence answer.
None of these are magic. They are the result of training on human judgments about what “good assistance” looks like across many prompt types — including prompts from real customers, which InstructGPT emphasized.
PM takeaway
Users experience alignment as usability, not as an ethics whitepaper.
12. The Alignment Tax
OpenAI described an alignment tax: making models more aligned can sometimes reduce performance on some capability benchmarks — unless you invest in techniques that recover capability while preserving alignment gains.
For PMs, “alignment tax” shows up in product tradeoffs:
| Tradeoff | What users may notice |
|---|---|
| More refusals | Safer, but more “I can’t help with that” friction |
| More caution | Less false certainty, but sometimes less decisive tone |
| More generic helpfulness | Polished answers that lack domain edge |
| Stricter style | Consistent brand voice, less creative variation |
| Training cost | Slower iteration cycles for behavior changes |
A mature product team measures both sides: harm rate and task completion, expert approval and latency/cost. The goal is not maximum refusal. The goal is maximum trustworthy usefulness.
13. Limitations PMs Should Know
RLHF and InstructGPT-style training are powerful — and incomplete.
| Limitation | Product risk |
|---|---|
| Preference ≠ truth | Fluent, persuasive wrong answers can still win comparisons |
| Labeler distribution | Behavior reflects who labeled the data |
| Coverage gaps | Rare enterprise edge cases under-trained |
| Static training snapshot | Policies and products evolve faster than the model |
| Reward model errors | Systematic blind spots at scale |
| No substitute for tools/RAG | Behavior tuning does not fix missing sources |
Weak PRD line: “We use an RLHF model, so answers will be correct.”
Better requirement: Grounding, eval suites, human review for high-risk flows, telemetry on failure modes, and explicit behavior specs per workflow.
14. Who Are the Labelers?
Alignment data is shaped by the people who write demonstrations and rank outputs. InstructGPT used trained labelers following guidelines — not random crowd votes without context.
PM questions to ask your ML or vendor partner:
- What instructions did labelers follow?
- What cultures, domains, and languages are represented?
- Were labelers generalists or domain experts for your use case?
- How were disagreements resolved?
- How often are guidelines updated?
| Labeling approach | Best for | Watch-out |
|---|---|---|
| General helpfulness labelers | Broad assistant products | May miss niche domain nuance |
| Domain expert reviewers | Healthcare, legal, finance workflows | Higher cost, slower throughput |
| Customer-success guided rubrics | Enterprise support tone and policy | Needs maintenance as policies change |
| Internal PM + SME review loops | Custom fine-tunes for one company | Must avoid overfitting to demo prompts |
PM takeaway
Your product’s “values” show up in labeling guidelines — whether you write them or not.
15. Real Prompts vs Lab Demos
A major InstructGPT design choice was training and evaluating on prompts submitted by real API users, not only synthetic academic prompts.
Why this matters for PMs:
| Eval source | What it optimizes |
|---|---|
| Clean lab prompts | Benchmark-friendly behavior |
| Real production prompts | Messy, ambiguous, multi-intent user behavior |
| Sales demo scripts | Overfitted wow moments |
| Support ticket replay sets | Actual failure modes and phrasing |
If your eval set is only polished demos, you will ship a polished demo machine. Build eval corpora from anonymized production logs (with privacy controls), escalations, and “tickets we lost sleep over.”
16. Instruction Following as a Product Feature
Instruction following is not a research metric only. It is a feature requirement — as concrete as latency or export format.
Define it per workflow:
| Workflow | Instruction spec example |
|---|---|
| Claims summary | Must use only retrieved claim fields; max 120 words; bullet format |
| Code review assistant | Must list severity-tagged findings; no drive-by refactors |
| Sales email draft | Must follow brand tone guide; no fabricated pricing |
| Policy Q&A | Must cite clause IDs; must say “not found in sources” when missing |
InstructGPT’s lesson: behavior improves when training targets match how users actually instruct the system. Your fine-tune, eval, and prompt templates should target the same instruction distribution.
PM takeaway
Write instruction-following acceptance tests the way you write functional requirements.
17. Why Prompting Alone Is Not Enough
Prompt engineering is essential — but it competes with billions of parameters trained on internet-scale text. A system prompt can say “be concise,” while the base prior still favors long completions.
| Approach | Strength | Limit |
|---|---|---|
| Prompt-only | Fast to iterate | Fragile under edge cases and adversarial inputs |
| RAG / tools | Improves facts and actions | Does not fully fix tone, refusal, or format discipline |
| SFT / RLHF | Bakes preferred behavior into weights | Expensive; needs governance and evals |
| Runtime guardrails | Catches known failure classes | Can over-block legitimate work |
Modern AI products combine all layers. InstructGPT is the historical proof that training-time alignment was necessary to make chat-based products mainstream — not just better prompts on a raw base model.
18. Deployment and Safety Layers
The model weights are not the whole product. Deployment adds monitoring, moderation, rate limits, tool permissions, logging, and incident response.
| Layer | Purpose |
|---|---|
| Aligned model (SFT + RLHF) | Default behavior prior |
| System + developer messages | Session-level policy and task framing |
| RAG / retrieval | Ground answers in approved sources |
| Tool gateways | Enforce what the model can execute |
| Output filters | Block known policy violations |
| Human review queues | Escalate uncertain or high-risk outputs |
| Telemetry + eval regression | Detect drift after model updates |
Chapter 4 covered red teaming and constitutional principles. Here the PM move is operational: define what happens when the aligned model still fails in production — because it will, eventually, on a novel prompt.
19. Enterprise Product Implications
Enterprise PMs rarely train frontier models from scratch. You still inherit InstructGPT’s lessons when selecting vendors, designing fine-tunes, or shipping copilots.
Vendor diligence checklist
- Is the model instruction-tuned or primarily a base/completion model?
- What human feedback pipeline exists (SFT, rankings, RL)?
- How are refusals, PII, and policy violations evaluated?
- Can you bring private preference data for domain alignment?
- What regression tests run before model version bumps?
Build vs buy behavior
| Strategy | When it fits |
|---|---|
| Use vendor-aligned model only | General assistant, low domain risk |
| SFT on internal golden sets | Repeated structured workflows |
| Preference eval + prompt/RAG tuning | Rapid iteration before full fine-tune |
| Custom reward objectives | Mature ML platform, high volume, clear rubric |
Also plan for versioning: when the vendor ships a new aligned snapshot, your refusal rates, citation behavior, and JSON reliability can shift. Treat model upgrades like API breaking-change reviews.
PM takeaway
Enterprise trust comes from behavior specs + evals + controls — aligned weights are the foundation, not the finish line.
20. The PM Mental Model
InstructGPT is the bridge story between “large language model” and “assistant product.”
RLHF is how you encode what “better” means — then optimize for it.
| Stage | PM question to ask |
|---|---|
| Base model | Do we have enough raw capability for this use case? |
| SFT | Do we have golden answers that define great behavior? |
| Rankings | Can experts or users compare outputs reliably? |
| Reward model | Does our scorer match business definitions of quality? |
| RL fine-tuning | Are we optimizing the right behaviors at scale? |
| Deployment | What happens when the model is wrong in production? |
Chapter 4 taught that alignment is a product architecture concern. Chapter 5 shows the historical mechanism: human demonstrations, human preferences, reward models, and reinforcement fine-tuning — the stack behind instruction-following assistants.
The real PM version
Users do not experience parameters. They experience whether the system did the task — safely, clearly, and on instruction.
Chapter Summary
| Concept | PM understanding |
|---|---|
| Base model problem | Capability without instruction-following is not a shippable assistant |
| InstructGPT | OpenAI’s aligned training story: demonstrations + rankings + RL |
| Size lesson | Smaller aligned models can beat larger base models in human preference |
| RLHF | Train preferred behavior using human feedback at scale |
| SFT | Teach the model ideal responses via demonstrations |
| Preference data | Rankings encode tradeoffs between candidate answers |
| Reward model | Automated critic learned from human comparisons |
| PPO | Reinforcement fine-tuning that optimizes against the reward model |
| Not just safety | RLHF improves usability, format, tone, and task completion |
| Alignment tax | Alignment gains can trade off with some capabilities unless managed |
| Labelers | Behavior reflects labeling guidelines and reviewer pool |
| Real prompts | Production-like eval data beats demo-only testing |
| Deployment | Aligned weights plus product guardrails earn trust |
Closing Thought
Before InstructGPT, many teams treated the base model as the product. After InstructGPT, the industry learned that the product is the stack: capability, demonstrations, preferences, optimization, and deployment discipline.
If you are a PM shipping an AI feature today, you are downstream of this story — even if you never train a reward model yourself. Your specs, eval rubrics, golden data, and workflow guardrails are how alignment shows up in your product.
The next step in this module is controlling how the model generates each token — temperature, top-p, and sampling — the knobs that shape creativity versus consistency in live products.
The real PM lesson
Instruction-following is trained, measured, and shipped — not wished into existence with a longer system prompt.
Chapter navigation
Chapter 4: AI Safety, RLHF, and Constitutional AI — The PM Version
AI safety, alignment, RLHF, Constitutional AI, red teaming, and product architecture for trustworthy AI.
Read chapter → Next →Chapter 6: Fine-Tuning vs Prompting — The PM Version
When to use prompting, fine-tuning, RAG, and tools — a PM decision framework.
Read chapter →