Introduction
In Chapter 2, we saw how transformers became Large Language Models.
In Chapter 3, we covered tokens and context windows — the working memory of an AI system.
In Chapter 4, we introduced AI safety, RLHF, and Constitutional AI — why alignment matters and how human feedback shapes behavior.
In Chapter 5, we went deeper into InstructGPT and the RLHF pipeline that turned base models into instruction-following assistants.
In Chapter 6, we compared prompting, fine-tuning, RAG, and tools — the layers that shape what the model knows and how it behaves.
Now we turn to a question that confuses users, frustrates QA teams, and surprises many product managers:
Why does the same prompt sometimes produce different answers?
The answer is not randomness for its own sake. It is sampling — the process by which a language model turns probability scores into the next token, one token at a time. Temperature, top-p, and related controls determine how adventurous or conservative that process is.
This matters for product managers because sampling settings directly affect reliability, regression risk, user trust, evaluation design, structured output stability, and whether your feature feels “consistent” or “creative.” A support bot and a brainstorming assistant should not use the same generation settings — even if they share the same base model.
This chapter explains temperature, top-p, and sampling from a PM lens — not as a math lecture, but as a practical product control framework.
The simple PM version
Low temperature and tight top-p for reliability.
Higher temperature for variation and creativity.
Match sampling to the product job — not to what feels interesting in a demo.
1. Why the Same Prompt Gives Different Answers
A user submits the same question twice. The model gives two different answers. A PM runs the same eval prompt ten times. Three runs pass; seven fail. A demo looks great on Monday and flaky on Tuesday.
This is normal behavior for most LLM products — unless you deliberately configure the system for repeatability.
Common reasons outputs vary
| Cause | What it means for PMs |
|---|---|
| Sampling is enabled | The model randomly picks among likely tokens |
| Temperature is above zero | Less likely tokens get a real chance |
| Top-p allows multiple paths | Output can branch early and diverge |
| No seed or seed not supported | Runs are not guaranteed to repeat |
| Model or API version changed | Probabilities shifted under the hood |
| Context changed slightly | Even small input differences matter |
| Streaming or truncation | Max tokens or stop sequences cut output differently |
Users often interpret variation as “the AI is confused.” Engineers often interpret it as “expected sampling behavior.” Product managers need a third view: Is this variation acceptable for this feature?
PM takeaway
Different answers are not always a bug. Sometimes they are a product choice. Your job is to decide whether a feature should be repeatable, exploratory, or both — and configure sampling accordingly.
2. From Probabilities to Picks: How Generation Works
At each step, the model does not “know” the full answer upfront. It predicts a probability distribution over possible next tokens, then picks one. That token is appended to the output. The process repeats until the model stops.
A simplified view of one generation step:
| Step | What happens |
|---|---|
| 1 | Model reads prompt + generated text so far |
| 2 | Model scores every candidate next token |
| 3 | Scores become probabilities |
| 4 | Sampling rules filter or reshape those probabilities |
| 5 | One token is selected |
| 6 | Loop continues until stop condition |
This is why small early differences compound. If the model chooses “However” instead of “Therefore” on token three, the rest of the answer may follow a different path.
PM analogy
Think of autocomplete on your phone, but at every word the keyboard ranks many options and sometimes picks a less obvious one on purpose. That “on purpose” part is sampling policy — and it is configurable.
3. What Is Sampling?
Sampling is the rule for choosing the next token from a probability distribution. Without sampling controls, a model would always pick the single highest-probability token — predictable, but often repetitive and dull. With sampling, the model can explore plausible alternatives.
| Term | Plain-language meaning |
|---|---|
| Probability distribution | Ranked list of likely next tokens |
| Sampling | How one token is chosen from that list |
| Deterministic decoding | Always pick the top choice |
| Stochastic decoding | Allow randomness among likely choices |
| Seed | Fixed starting point for randomness, when supported |
Sampling is not the same as intelligence. It does not make the model smarter. It changes how the model expresses uncertainty — more conservatively or more creatively.
For PMs, the key question is:
Should this feature explore alternatives, or should it converge on one reliable answer?
4. Greedy Decoding and Deterministic Outputs
Greedy decoding means always selecting the highest-probability token at each step. It is the most deterministic common strategy — though “deterministic” still depends on model version, infrastructure, and whether other randomness exists.
| Trait | Effect |
|---|---|
| Most predictable path | Same prompt often yields same output |
| Low creativity | Repetitive phrasing is more likely |
| Good for structured tasks | Classification, extraction, JSON-like outputs |
| Not a guarantee of correctness | Top choice can still be wrong |
In practice, many API products achieve near-greedy behavior with temperature = 0 (or the provider’s equivalent “deterministic” setting). Some providers also expose an explicit seed parameter for stronger repeatability.
PM takeaway
If your feature breaks when wording changes slightly — policy Q&A, form filling, routing labels, compliance summaries — start with low or zero temperature. Do not assume users want creative variation in operational workflows.
5. Temperature: The Creativity Knob
Temperature scales the model’s probability distribution before sampling. Low temperature sharpens the distribution — top tokens dominate. High temperature flattens it — lower-ranked tokens become more likely.
| Temperature | Typical behavior | Product feel |
|---|---|---|
| 0 | Most deterministic | Reliable, repetitive |
| 0.1–0.3 | Mostly conservative | Stable assistant |
| 0.5–0.7 | Balanced variation | General-purpose chat |
| 0.8–1.0 | More diverse wording | Brainstorming, drafting |
| > 1.0 | High randomness | Experimental, risky for production |
Temperature does not change what the model knows. It changes how willing the system is to choose less obvious continuations.
Example
Prompt: “Suggest three subject lines for a renewal email.” At temperature 0.2, subject lines may be safe and similar across runs. At temperature 0.9, subject lines may vary widely — useful for ideation, risky if brand tone must stay tightly controlled.
PM takeaway
Temperature controls expressiveness, not factual accuracy. A confident wrong answer at temperature 0 is still wrong.
6. Top-p (Nucleus Sampling)
Top-p, also called nucleus sampling, limits choices to the smallest set of tokens whose combined probability mass reaches p. Instead of considering every possible next token, the model samples only from the “nucleus” of plausible options.
| Top-p value | What it does |
|---|---|
| 0.1 | Very narrow set of candidates — conservative |
| 0.5 | Moderate candidate pool |
| 0.9 | Broader pool — more variation |
| 1.0 | No nucleus filter — full distribution after other rules |
Top-p is often more adaptive than top-k because the candidate set size changes with context. When one token is overwhelmingly likely, top-p may effectively behave like greedy decoding. When probabilities are spread out, top-p allows more alternatives.
PM takeaway
Top-p is a quality filter on randomness. It helps prevent the model from selecting bizarre low-probability tokens while still allowing useful variation. Many production systems use moderate top-p with low temperature.
7. Top-k: A Simpler Filter
Top-k sampling restricts selection to the k highest-probability tokens, then samples from that fixed shortlist. For example, top-k = 40 means only the top 40 candidates are eligible.
| Approach | How it filters | PM note |
|---|---|---|
| Top-k | Fixed number of candidates | Simple, but less adaptive |
| Top-p | Dynamic candidate set by probability mass | Usually preferred in modern APIs |
| Temperature | Reshapes all probabilities | Often used together with top-p |
Some providers expose top-k; others emphasize top-p. As a PM, you do not need to master every decoding algorithm. You need to know which controls your stack exposes and what behavior they produce in evals.
In most product discussions, top-p and temperature are the primary levers. Top-k matters when your engineering team or model host explicitly recommends it.
8. Deterministic vs Creative Product Modes
Mature AI products often behave like they have modes — even if users never see the underlying parameters.
| Product mode | Sampling posture | Example features |
|---|---|---|
| Deterministic / precise | Low temperature, low top-p, optional seed | Classification, extraction, routing, JSON output |
| Balanced assistant | Moderate temperature and top-p | General chat, summarization, email drafting |
| Creative / exploratory | Higher temperature, broader top-p | Brainstorming, marketing copy, story ideas |
| Multi-option generation | Moderate to high variation with n > 1 | Regenerate, compare alternatives, A/B drafts |
The mistake is using one global setting for the entire product. A single “temperature = 0.7” default may be fine for open chat and harmful for structured workflows.
PM takeaway
Design sampling per feature, not per model. Your support classifier and your campaign copy generator should not share the same defaults by accident.
9. Reliability, Regression, and User Trust
Users trust AI features when outputs feel stable enough to act on. Sampling instability erodes trust quickly — especially in high-stakes workflows.
How sampling affects trust
| User observation | Likely sampling issue | Product risk |
|---|---|---|
| “It gave a different answer this time” | Non-zero temperature or no seed | Perceived unreliability |
| “It was correct in staging” | Eval ran once, not multiple times | False confidence |
| “Regenerate made it worse” | High variation is working as configured | UX mismatch |
| “The JSON broke overnight” | Sampling + parser fragility | Production incident |
| “Demo and prod behave differently” | Different defaults across environments | Release risk |
Reliability is not only about correctness. It is about predictable behavior under repeated use. A PM should define acceptable variance before launch — not after users complain.
PM takeaway
For operational features, document expected repeatability. If variation is intentional, explain it in the UX: “Regenerate for alternatives,” not silent inconsistency.
10. When Low Temperature Matters
Use low temperature when the product job rewards consistency over novelty.
| Use case | Why low temperature |
|---|---|
| Intent classification | Labels should not drift between runs |
| Data extraction | Field values must stay stable |
| Policy Q&A | Users expect the same question to get the same guidance |
| Structured JSON output | Reduces format and key variation |
| Routing and triage | Downstream workflows depend on stable labels |
| Compliance summaries | Language should not change meaning on rerun |
| Automated eval baselines | Reduces noise when comparing prompt versions |
Low temperature reduces randomness. It does not eliminate all failure modes. You still need good prompts, grounding, validation, and evals.
- Start with temperature 0 for structured workflows.
- Increase only if outputs feel too repetitive and repetition is a user problem.
- Pair low temperature with schema validation or post-processing checks.
11. When Higher Temperature Helps
Higher temperature is useful when the product goal is exploration, diversity, or fresh wording — not strict repeatability.
| Use case | Why higher temperature |
|---|---|
| Brainstorming | Users want multiple distinct ideas |
| Marketing copy variants | Variation helps compare tone and hooks |
| Creative writing aids | Predictable output feels stale |
| Workshop facilitation | Diverse prompts spark discussion |
| Regenerate button | Users expect a meaningfully different alternative |
| Idea expansion | Exploring adjacent concepts is the value |
The product contract matters. If the UI says “Generate alternatives,” variation is a feature. If the UI says “Get the answer,” variation feels like a bug.
PM takeaway
Higher temperature is not free creativity. It increases the chance of off-brand language, unsupported claims, and parser failures. Use it where divergence is desired — and guardrail elsewhere.
12. Using Temperature and Top-p Together
Temperature and top-p are often configured together. Temperature reshapes probabilities; top-p limits which tokens are eligible. Used well, they balance stability and flexibility.
| Configuration | Typical result | Good for |
|---|---|---|
| Temp 0 + top-p 1.0 | Near-greedy, most stable | Structured tasks, eval baselines |
| Temp 0.2 + top-p 0.9 | Mostly stable with slight phrasing flexibility | Support replies, summaries |
| Temp 0.7 + top-p 0.95 | Noticeable variation | General assistant chat |
| Temp 1.0 + top-p 0.95 | High diversity | Brainstorming, creative drafts |
| Temp 0 + top-p 0.5 | Very constrained even before greedy pick | Hard stability requirements |
Do not tune both blindly. Change one variable at a time in evals. Otherwise you cannot explain why behavior improved or regressed.
Practical starting pattern
For production assistants: moderate top-p (0.8–0.95) with low-to-moderate temperature (0.2–0.5), tuned per feature.
For operational extraction/classification: temperature 0 with validation.
13. Sampling Across Models and API Versions
Identical sampling settings do not guarantee identical outputs across model families, model sizes, or provider updates. The probability landscape changes when the model changes.
| Change | What may happen |
|---|---|
| New model version | Same prompt, different default behavior |
| Switch from GPT to Claude to Gemini | Optimal temperature differs by stack |
| Smaller model replaces larger model | More brittle outputs at same settings |
| Provider changes decoding defaults | Silent behavior shift |
| Fine-tuned model deployed | Distribution narrows or shifts |
This is why sampling settings should be versioned alongside prompts and eval suites. When you upgrade a model, re-run evals at the intended temperature and top-p — not just once, but multiple times per test case if variation matters.
PM takeaway
Treat sampling config as part of release management. “We kept temperature at 0.2” is not enough if the model underneath changed.
14. Sampling vs Prompting and Fine-Tuning
Sampling controls how the model chooses among plausible continuations. Prompting and fine-tuning shape what the model is likely to say in the first place. These layers solve different problems.
| Problem | Sampling change | Prompt / fine-tune change |
|---|---|---|
| Output varies too much run-to-run | Lower temperature / top-p | Prompt may help, but may not fix randomness |
| Output is repetitive and dull | Raise temperature modestly | Prompt examples can also increase diversity |
| Wrong task behavior | Sampling will not fix | Better instructions or fine-tuning |
| Wrong facts | Sampling will not fix | RAG, tools, grounding |
| Inconsistent JSON keys | Lower temperature helps | Prompt schema + validation helps more |
| Off-brand tone | Slight effect | Prompt or fine-tuning is primary fix |
A common PM mistake is trying to solve a behavior problem by tweaking temperature. If the model consistently misunderstands the task, fix the prompt, context, or training — not randomness alone.
PM takeaway
Prompting shapes intent. Fine-tuning shapes learned behavior. Sampling shapes variation. Use the right layer for the failure you actually see.
15. Common Product Scenarios
Here is how sampling choices often map to real product features.
| Feature | Recommended posture | Why |
|---|---|---|
| Customer support classifier | Temperature 0 | Stable routing labels |
| Claims field extraction | Temperature 0 + validation | Parser and workflow depend on structure |
| FAQ assistant | Low temperature | Users expect consistent guidance |
| Email rewrite assistant | Low to moderate | Some phrasing flexibility, not chaos |
| Subject line generator | Moderate to high | Variation is the product value |
| PRD brainstorming copilot | Moderate to high | Diverse ideas help discovery |
| Code explanation tool | Low | Accuracy and clarity over novelty |
| Regenerate button in chat | Moderate+ | Users expect a different alternative |
Notice that the same product can contain multiple features with different sampling policies. That is normal and desirable.
16. Structured Output and Sampling
Structured output — JSON, tables, fixed field schemas — is especially sensitive to sampling variation. A single different key, trailing comma, or reordered field can break downstream parsers.
Best practice for structured workflows:
- Use low or zero temperature.
- Specify schema clearly in the prompt or use provider schema modes when available.
- Validate output before passing it to business logic.
- Retry with repair prompt only when validation fails — not on every request.
Example: claim summary JSON
Prompt asks for structured output. At low temperature, the model reliably returns parseable JSON:
{
"claim_id": "CLM-2026-10482",
"summary": "Hospitalization claim for appendectomy. Discharge summary present.",
"missing_documents": ["itemized_bill"],
"recommended_action": "request_missing_docs",
"confidence": "medium"
}
At higher temperature, the same prompt may produce acceptable content in a broken envelope:
- Different key names (
missingDocsvsmissing_documents) - Extra commentary outside the JSON block
- Valid JSON with a different schema shape
- Markdown fences that parsers do not expect
PM takeaway
For anything consumed by code, treat sampling as part of the contract. Low temperature reduces variation; schema validation catches what sampling cannot.
17. Max Tokens, Stop Sequences, and Seeds
Sampling is not the only generation control PMs should know. These adjacent parameters also shape product behavior.
| Parameter | What it does | PM relevance |
|---|---|---|
| Max tokens | Limits response length | Prevents runaway cost and overly long outputs |
| Stop sequences | Ends generation when specific text appears | Useful for section boundaries and tool-call formats |
| Seed | Fixes randomness source when supported | Improves repeatability for tests and audits |
| Presence / frequency penalties | Reduces repetition | Can help long-form drafting features |
| n / best_of | Generate multiple candidates | Powers compare-and-choose UX |
Max tokens is a product decision, not just an engineering safeguard. If users need concise support replies, a tight max tokens limit enforces brevity even when the model would otherwise ramble.
Seeds are valuable for QA and regression testing, but support varies by provider and may not promise identical outputs across infrastructure changes. Use seeds to reduce variance in evals — not as a permanent substitute for validation.
PM takeaway
Think of generation config as a bundle: temperature, top-p, max tokens, stop sequences, and optional seed. Document the bundle per feature.
18. Recommended Starting Points
Defaults should be intentional. Use this table as a starting point, then tune with evals.
| Use case | Temperature | Top-p | Notes |
|---|---|---|---|
| Classification / routing | 0 | 1.0 or provider default | Validate label set strictly |
| Extraction to JSON | 0 | 1.0 | Schema validation required |
| Policy / compliance Q&A | 0–0.2 | 0.8–0.95 | Prioritize consistency |
| Support reply drafting | 0.2–0.4 | 0.85–0.95 | Human review still recommended |
| General assistant chat | 0.5–0.7 | 0.9–0.95 | Balance helpfulness and stability |
| Summarization | 0.2–0.5 | 0.85–0.95 | Lower if summaries feed workflows |
| Marketing copy variants | 0.7–1.0 | 0.9–1.0 | Expect divergence by design |
| Brainstorming / ideation | 0.8–1.1 | 0.95–1.0 | Guardrail claims and facts separately |
These are starting points, not universal laws. Your model, prompt, and eval data should drive the final settings.
19. Designing Multi-Output UX
Sampling variation becomes a feature when the UX is designed for it. Poor multi-output UX feels broken; good multi-output UX feels intentional.
Common patterns
| Pattern | What it does | Sampling implication |
|---|---|---|
| Regenerate | User asks for another answer | Needs enough variation to feel fresh |
| n variants upfront | Show 2–3 options immediately | Use n > 1 or multiple sampled calls |
| Compare view | Side-by-side diff of alternatives | Helpful for copy and strategy tools |
| Pin / accept answer | User locks one option | Turns variation into choice, not confusion |
| History of attempts | Prior outputs remain visible | Reduces distrust when answers differ |
UX principles for multi-output features:
- Label variation as intentional (“Try another version”).
- Show what changed — not just a new blob of text.
- Let users accept, edit, or reject an output.
- Do not regenerate automatically in high-stakes flows.
- Log which sampling config produced each variant.
PM takeaway
Regenerate is a product promise. If temperature is 0, regenerate may disappoint. If temperature is too high, regenerate may destabilize trust. Match the control to the promise.
20. Sampling and Evaluation
Evaluating an LLM feature once per test case is often insufficient when sampling is enabled. You need to measure both quality and consistency.
| Metric / concept | What it tells you |
|---|---|
| Pass@1 | Success rate on a single run |
| Pass@k | Probability that at least one of k runs succeeds |
| Consistency rate | How often key fields or labels match across runs |
| Format validity rate | How often output parses against schema |
| Semantic variance | Whether meaning changes, not just wording |
| Regression suite | Repeat critical prompts after model or config changes |
Example: a classification feature may score 92% pass@1 but only 78% consistency across five runs at temperature 0.4. That gap is a product signal — not a footnote.
Eval practices for PMs
- Run high-risk prompts multiple times before launch.
- Track format failures separately from factual errors.
- Compare configs side by side: temp 0 vs temp 0.3 vs temp 0.7.
- Include “regenerate” scenarios in eval sets for creative features.
- Version sampling config in eval reports.
PM takeaway
A feature can be “good on average” and still fail in production because variation is too high. Measure consistency explicitly.
21. Should Users Control Temperature?
Most consumer users do not think in terms of temperature or top-p. Exposing raw sliders creates confusion, support burden, and inconsistent outcomes.
A better pattern is user-facing presets that map to backend sampling configs.
| User-facing preset | Backend posture | Best for |
|---|---|---|
| Precise | Temperature 0–0.2, tighter top-p | Factual answers, structured tasks |
| Balanced | Temperature 0.4–0.6, moderate top-p | Everyday assistant use |
| Creative | Temperature 0.8+, broader top-p | Brainstorming, drafting alternatives |
Advanced controls can exist in developer settings or admin panels — not in the default consumer path.
PM takeaway
Translate knobs into outcomes. Users choose “More creative,” not “temperature 0.9.” Your team owns the mapping and validates it with evals.
22. PM Decision Framework
Use this framework when choosing sampling settings for a feature.
Step 1: Define the product job
| Question | If yes | If no |
|---|---|---|
| Must output be repeatable? | Favor temperature 0 | Variation may be acceptable |
| Does code parse the output? | Low temperature + validation | More flexibility allowed |
| Do users want alternatives? | Design multi-output UX | Optimize for single best answer |
| Is wording novelty valuable? | Moderate to high temperature | Keep temperature low |
| Is factual stability critical? | Low temperature + grounding | Creativity can be prioritized |
Step 2: Choose the minimum sufficient randomness
| Failure observed | First lever |
|---|---|
| Same answer feels stale | Slightly raise temperature or add regenerate |
| Answers change too much | Lower temperature and/or top-p |
| JSON breaks intermittently | Temperature 0 + schema validation |
| Regenerate feels identical | Increase variation modestly |
| Model ignores instructions | Fix prompt — not temperature first |
| Facts are wrong | RAG/tools — not temperature first |
Step 3: Validate with repeated runs
Before shipping, run representative prompts multiple times. If unacceptable divergence appears, tighten sampling or add validation — do not rely on “usually works.”
23. What PMs Should Not Do
Avoid these common sampling mistakes.
| Mistake | Why it is bad |
|---|---|
| Using one global temperature for the whole product | Different features need different stability |
| Raising temperature to fix factual errors | Randomness does not add truth |
| Testing a prompt only once before launch | Hides sampling variance |
| Exposing raw temperature sliders to all users | Creates confusion and support load |
| Assuming temperature 0 means “production safe” | Correctness and grounding still matter |
| Ignoring config drift across environments | Staging and prod behave differently |
| Shipping structured output without validation | Intermittent parser failures become incidents |
| Changing temperature and top-p together during debugging | You cannot tell what helped |
| Not logging generation settings | Impossible to debug user reports |
Sampling is a powerful control. Used carelessly, it creates the illusion of progress while leaving the real product problem untouched.
24. Final Mental Model
The model predicts probabilities.
Sampling turns probabilities into picks.
Temperature controls how adventurous the picks are.
Top-p controls how many options are allowed.
Prompting shapes what is likely.
Validation ensures the product can trust the result.
If you remember only one thing from this chapter, remember this:
Match sampling to the job: low randomness for reliability, higher randomness for exploration — and never confuse variation with understanding.
That is the practical PM answer.
Chapter Summary
| Concept | PM understanding |
|---|---|
| Sampling | How the model chooses the next token from probabilities |
| Why outputs vary | Randomness, settings, model version, and context differences |
| Greedy decoding | Always pick the top token — most deterministic common mode |
| Temperature | Creativity knob — lower is stable, higher is varied |
| Top-p | Limits choices to a nucleus of plausible tokens |
| Top-k | Fixed shortlist filter — simpler, less adaptive than top-p |
| Deterministic vs creative modes | Product features need different defaults |
| Structured output | Low temperature plus schema validation |
| Seeds and max tokens | Part of the full generation config bundle |
| Multi-output UX | Regenerate and variants should be intentional, not accidental |
| Evals | Measure pass@k and consistency — not just one-run accuracy |
| User controls | Presets like Precise / Balanced / Creative beat raw sliders |
| PM role | Choose the minimum sufficient randomness for each feature |
Closing Thought
Users ask why the AI “changed its mind.” Often the answer is not that the model learned something new overnight. It is that the product allowed a different path through the same probability landscape.
That is not inherently bad. Creativity, brainstorming, and regenerate flows depend on exactly this behavior. But operational AI — routing, extraction, compliance, structured workflows — depends on the opposite.
Product managers do not need to implement sampling algorithms. They do need to decide which features deserve stability, which deserve exploration, and how the UX makes that contract clear.
Get that decision right, and temperature stops being a mysterious engineering knob. It becomes a product design choice — one you can explain, evaluate, and defend.
The next chapter in this module compares pre-training, fine-tuning, and RLHF — how each stage shapes what a model knows and how it behaves before sampling ever enters the picture.
The real PM lesson
Control randomness on purpose. Accidental variation is a bug; intentional variation is a feature.
Chapter navigation
Chapter 6: Fine-Tuning vs Prompting — The PM Version
When to use prompting, fine-tuning, RAG, and tools — a PM decision framework for model optimization.
Read chapter → Next →Chapter 8: Why LLMs Hallucinate — The PM Version
Why confident wrong answers happen — and how product teams reduce hallucination risk.
Read chapter →