Temperature, Top-p, and Sampling — The PM Version

Introduction

In Chapter 2, we saw how transformers became Large Language Models.

In Chapter 3, we covered tokens and context windows — the working memory of an AI system.

In Chapter 4, we introduced AI safety, RLHF, and Constitutional AI — why alignment matters and how human feedback shapes behavior.

In Chapter 5, we went deeper into InstructGPT and the RLHF pipeline that turned base models into instruction-following assistants.

In Chapter 6, we compared prompting, fine-tuning, RAG, and tools — the layers that shape what the model knows and how it behaves.

Now we turn to a question that confuses users, frustrates QA teams, and surprises many product managers:

Why does the same prompt sometimes produce different answers?

The answer is not randomness for its own sake. It is sampling — the process by which a language model turns probability scores into the next token, one token at a time. Temperature, top-p, and related controls determine how adventurous or conservative that process is.

This matters for product managers because sampling settings directly affect reliability, regression risk, user trust, evaluation design, structured output stability, and whether your feature feels “consistent” or “creative.” A support bot and a brainstorming assistant should not use the same generation settings — even if they share the same base model.

This chapter explains temperature, top-p, and sampling from a PM lens — not as a math lecture, but as a practical product control framework.

The simple PM version

Low temperature and tight top-p for reliability.
Higher temperature for variation and creativity.
Match sampling to the product job — not to what feels interesting in a demo.

1. Why the Same Prompt Gives Different Answers

A user submits the same question twice. The model gives two different answers. A PM runs the same eval prompt ten times. Three runs pass; seven fail. A demo looks great on Monday and flaky on Tuesday.

This is normal behavior for most LLM products — unless you deliberately configure the system for repeatability.

Common reasons outputs vary

Cause	What it means for PMs
Sampling is enabled	The model randomly picks among likely tokens
Temperature is above zero	Less likely tokens get a real chance
Top-p allows multiple paths	Output can branch early and diverge
No seed or seed not supported	Runs are not guaranteed to repeat
Model or API version changed	Probabilities shifted under the hood
Context changed slightly	Even small input differences matter
Streaming or truncation	Max tokens or stop sequences cut output differently

Users often interpret variation as “the AI is confused.” Engineers often interpret it as “expected sampling behavior.” Product managers need a third view: Is this variation acceptable for this feature?

PM takeaway

Different answers are not always a bug. Sometimes they are a product choice. Your job is to decide whether a feature should be repeatable, exploratory, or both — and configure sampling accordingly.

2. From Probabilities to Picks: How Generation Works

At each step, the model does not “know” the full answer upfront. It predicts a probability distribution over possible next tokens, then picks one. That token is appended to the output. The process repeats until the model stops.

A simplified view of one generation step:

Step	What happens
1	Model reads prompt + generated text so far
2	Model scores every candidate next token
3	Scores become probabilities
4	Sampling rules filter or reshape those probabilities
5	One token is selected
6	Loop continues until stop condition

This is why small early differences compound. If the model chooses “However” instead of “Therefore” on token three, the rest of the answer may follow a different path.

PM analogy

Think of autocomplete on your phone, but at every word the keyboard ranks many options and sometimes picks a less obvious one on purpose. That “on purpose” part is sampling policy — and it is configurable.

3. What Is Sampling?

Sampling is the rule for choosing the next token from a probability distribution. Without sampling controls, a model would always pick the single highest-probability token — predictable, but often repetitive and dull. With sampling, the model can explore plausible alternatives.

Term	Plain-language meaning
Probability distribution	Ranked list of likely next tokens
Sampling	How one token is chosen from that list
Deterministic decoding	Always pick the top choice
Stochastic decoding	Allow randomness among likely choices
Seed	Fixed starting point for randomness, when supported

Sampling is not the same as intelligence. It does not make the model smarter. It changes how the model expresses uncertainty — more conservatively or more creatively.

For PMs, the key question is:

Should this feature explore alternatives, or should it converge on one reliable answer?

4. Greedy Decoding and Deterministic Outputs

Greedy decoding means always selecting the highest-probability token at each step. It is the most deterministic common strategy — though “deterministic” still depends on model version, infrastructure, and whether other randomness exists.

Trait	Effect
Most predictable path	Same prompt often yields same output
Low creativity	Repetitive phrasing is more likely
Good for structured tasks	Classification, extraction, JSON-like outputs
Not a guarantee of correctness	Top choice can still be wrong

In practice, many API products achieve near-greedy behavior with temperature = 0 (or the provider’s equivalent “deterministic” setting). Some providers also expose an explicit seed parameter for stronger repeatability.

PM takeaway

If your feature breaks when wording changes slightly — policy Q&A, form filling, routing labels, compliance summaries — start with low or zero temperature. Do not assume users want creative variation in operational workflows.

5. Temperature: The Creativity Knob

Temperature scales the model’s probability distribution before sampling. Low temperature sharpens the distribution — top tokens dominate. High temperature flattens it — lower-ranked tokens become more likely.

Temperature	Typical behavior	Product feel
0	Most deterministic	Reliable, repetitive
0.1–0.3	Mostly conservative	Stable assistant
0.5–0.7	Balanced variation	General-purpose chat
0.8–1.0	More diverse wording	Brainstorming, drafting
> 1.0	High randomness	Experimental, risky for production

Temperature does not change what the model knows. It changes how willing the system is to choose less obvious continuations.

Example

Prompt: “Suggest three subject lines for a renewal email.” At temperature 0.2, subject lines may be safe and similar across runs. At temperature 0.9, subject lines may vary widely — useful for ideation, risky if brand tone must stay tightly controlled.

PM takeaway

Temperature controls expressiveness, not factual accuracy. A confident wrong answer at temperature 0 is still wrong.

6. Top-p (Nucleus Sampling)

Top-p, also called nucleus sampling, limits choices to the smallest set of tokens whose combined probability mass reaches p. Instead of considering every possible next token, the model samples only from the “nucleus” of plausible options.

Top-p value	What it does
0.1	Very narrow set of candidates — conservative
0.5	Moderate candidate pool
0.9	Broader pool — more variation
1.0	No nucleus filter — full distribution after other rules

Top-p is often more adaptive than top-k because the candidate set size changes with context. When one token is overwhelmingly likely, top-p may effectively behave like greedy decoding. When probabilities are spread out, top-p allows more alternatives.

PM takeaway

Top-p is a quality filter on randomness. It helps prevent the model from selecting bizarre low-probability tokens while still allowing useful variation. Many production systems use moderate top-p with low temperature.

7. Top-k: A Simpler Filter

Top-k sampling restricts selection to the k highest-probability tokens, then samples from that fixed shortlist. For example, top-k = 40 means only the top 40 candidates are eligible.

Approach	How it filters	PM note
Top-k	Fixed number of candidates	Simple, but less adaptive
Top-p	Dynamic candidate set by probability mass	Usually preferred in modern APIs
Temperature	Reshapes all probabilities	Often used together with top-p

Some providers expose top-k; others emphasize top-p. As a PM, you do not need to master every decoding algorithm. You need to know which controls your stack exposes and what behavior they produce in evals.

In most product discussions, top-p and temperature are the primary levers. Top-k matters when your engineering team or model host explicitly recommends it.

8. Deterministic vs Creative Product Modes

Mature AI products often behave like they have modes — even if users never see the underlying parameters.

Product mode	Sampling posture	Example features
Deterministic / precise	Low temperature, low top-p, optional seed	Classification, extraction, routing, JSON output
Balanced assistant	Moderate temperature and top-p	General chat, summarization, email drafting
Creative / exploratory	Higher temperature, broader top-p	Brainstorming, marketing copy, story ideas
Multi-option generation	Moderate to high variation with n > 1	Regenerate, compare alternatives, A/B drafts

The mistake is using one global setting for the entire product. A single “temperature = 0.7” default may be fine for open chat and harmful for structured workflows.

PM takeaway

Design sampling per feature, not per model. Your support classifier and your campaign copy generator should not share the same defaults by accident.

9. Reliability, Regression, and User Trust

Users trust AI features when outputs feel stable enough to act on. Sampling instability erodes trust quickly — especially in high-stakes workflows.

How sampling affects trust

User observation	Likely sampling issue	Product risk
“It gave a different answer this time”	Non-zero temperature or no seed	Perceived unreliability
“It was correct in staging”	Eval ran once, not multiple times	False confidence
“Regenerate made it worse”	High variation is working as configured	UX mismatch
“The JSON broke overnight”	Sampling + parser fragility	Production incident
“Demo and prod behave differently”	Different defaults across environments	Release risk

Reliability is not only about correctness. It is about predictable behavior under repeated use. A PM should define acceptable variance before launch — not after users complain.

PM takeaway

For operational features, document expected repeatability. If variation is intentional, explain it in the UX: “Regenerate for alternatives,” not silent inconsistency.

10. When Low Temperature Matters

Use low temperature when the product job rewards consistency over novelty.

Use case	Why low temperature
Intent classification	Labels should not drift between runs
Data extraction	Field values must stay stable
Policy Q&A	Users expect the same question to get the same guidance
Structured JSON output	Reduces format and key variation
Routing and triage	Downstream workflows depend on stable labels
Compliance summaries	Language should not change meaning on rerun
Automated eval baselines	Reduces noise when comparing prompt versions

Low temperature reduces randomness. It does not eliminate all failure modes. You still need good prompts, grounding, validation, and evals.

Start with temperature 0 for structured workflows.
Increase only if outputs feel too repetitive and repetition is a user problem.
Pair low temperature with schema validation or post-processing checks.

11. When Higher Temperature Helps

Higher temperature is useful when the product goal is exploration, diversity, or fresh wording — not strict repeatability.

Use case	Why higher temperature
Brainstorming	Users want multiple distinct ideas
Marketing copy variants	Variation helps compare tone and hooks
Creative writing aids	Predictable output feels stale
Workshop facilitation	Diverse prompts spark discussion
Regenerate button	Users expect a meaningfully different alternative
Idea expansion	Exploring adjacent concepts is the value

The product contract matters. If the UI says “Generate alternatives,” variation is a feature. If the UI says “Get the answer,” variation feels like a bug.

PM takeaway

Higher temperature is not free creativity. It increases the chance of off-brand language, unsupported claims, and parser failures. Use it where divergence is desired — and guardrail elsewhere.

12. Using Temperature and Top-p Together

Temperature and top-p are often configured together. Temperature reshapes probabilities; top-p limits which tokens are eligible. Used well, they balance stability and flexibility.

Configuration	Typical result	Good for
Temp 0 + top-p 1.0	Near-greedy, most stable	Structured tasks, eval baselines
Temp 0.2 + top-p 0.9	Mostly stable with slight phrasing flexibility	Support replies, summaries
Temp 0.7 + top-p 0.95	Noticeable variation	General assistant chat
Temp 1.0 + top-p 0.95	High diversity	Brainstorming, creative drafts
Temp 0 + top-p 0.5	Very constrained even before greedy pick	Hard stability requirements

Do not tune both blindly. Change one variable at a time in evals. Otherwise you cannot explain why behavior improved or regressed.

Practical starting pattern

For production assistants: moderate top-p (0.8–0.95) with low-to-moderate temperature (0.2–0.5), tuned per feature.
For operational extraction/classification: temperature 0 with validation.

13. Sampling Across Models and API Versions

Identical sampling settings do not guarantee identical outputs across model families, model sizes, or provider updates. The probability landscape changes when the model changes.

Change	What may happen
New model version	Same prompt, different default behavior
Switch from GPT to Claude to Gemini	Optimal temperature differs by stack
Smaller model replaces larger model	More brittle outputs at same settings
Provider changes decoding defaults	Silent behavior shift
Fine-tuned model deployed	Distribution narrows or shifts

This is why sampling settings should be versioned alongside prompts and eval suites. When you upgrade a model, re-run evals at the intended temperature and top-p — not just once, but multiple times per test case if variation matters.

PM takeaway

Treat sampling config as part of release management. “We kept temperature at 0.2” is not enough if the model underneath changed.

14. Sampling vs Prompting and Fine-Tuning

Sampling controls how the model chooses among plausible continuations. Prompting and fine-tuning shape what the model is likely to say in the first place. These layers solve different problems.

Problem	Sampling change	Prompt / fine-tune change
Output varies too much run-to-run	Lower temperature / top-p	Prompt may help, but may not fix randomness
Output is repetitive and dull	Raise temperature modestly	Prompt examples can also increase diversity
Wrong task behavior	Sampling will not fix	Better instructions or fine-tuning
Wrong facts	Sampling will not fix	RAG, tools, grounding
Inconsistent JSON keys	Lower temperature helps	Prompt schema + validation helps more
Off-brand tone	Slight effect	Prompt or fine-tuning is primary fix

A common PM mistake is trying to solve a behavior problem by tweaking temperature. If the model consistently misunderstands the task, fix the prompt, context, or training — not randomness alone.

PM takeaway

Prompting shapes intent. Fine-tuning shapes learned behavior. Sampling shapes variation. Use the right layer for the failure you actually see.

15. Common Product Scenarios

Here is how sampling choices often map to real product features.

Feature	Recommended posture	Why
Customer support classifier	Temperature 0	Stable routing labels
Claims field extraction	Temperature 0 + validation	Parser and workflow depend on structure
FAQ assistant	Low temperature	Users expect consistent guidance
Email rewrite assistant	Low to moderate	Some phrasing flexibility, not chaos
Subject line generator	Moderate to high	Variation is the product value
PRD brainstorming copilot	Moderate to high	Diverse ideas help discovery
Code explanation tool	Low	Accuracy and clarity over novelty
Regenerate button in chat	Moderate+	Users expect a different alternative

Notice that the same product can contain multiple features with different sampling policies. That is normal and desirable.

16. Structured Output and Sampling

Structured output — JSON, tables, fixed field schemas — is especially sensitive to sampling variation. A single different key, trailing comma, or reordered field can break downstream parsers.

Best practice for structured workflows:

Use low or zero temperature.
Specify schema clearly in the prompt or use provider schema modes when available.
Validate output before passing it to business logic.
Retry with repair prompt only when validation fails — not on every request.

Example: claim summary JSON

Prompt asks for structured output. At low temperature, the model reliably returns parseable JSON:

{
  "claim_id": "CLM-2026-10482",
  "summary": "Hospitalization claim for appendectomy. Discharge summary present.",
  "missing_documents": ["itemized_bill"],
  "recommended_action": "request_missing_docs",
  "confidence": "medium"
}

At higher temperature, the same prompt may produce acceptable content in a broken envelope:

Different key names (missingDocs vs missing_documents)
Extra commentary outside the JSON block
Valid JSON with a different schema shape
Markdown fences that parsers do not expect

PM takeaway

For anything consumed by code, treat sampling as part of the contract. Low temperature reduces variation; schema validation catches what sampling cannot.

17. Max Tokens, Stop Sequences, and Seeds

Sampling is not the only generation control PMs should know. These adjacent parameters also shape product behavior.

Parameter	What it does	PM relevance
Max tokens	Limits response length	Prevents runaway cost and overly long outputs
Stop sequences	Ends generation when specific text appears	Useful for section boundaries and tool-call formats
Seed	Fixes randomness source when supported	Improves repeatability for tests and audits
Presence / frequency penalties	Reduces repetition	Can help long-form drafting features
n / best_of	Generate multiple candidates	Powers compare-and-choose UX

Max tokens is a product decision, not just an engineering safeguard. If users need concise support replies, a tight max tokens limit enforces brevity even when the model would otherwise ramble.

Seeds are valuable for QA and regression testing, but support varies by provider and may not promise identical outputs across infrastructure changes. Use seeds to reduce variance in evals — not as a permanent substitute for validation.

PM takeaway

Think of generation config as a bundle: temperature, top-p, max tokens, stop sequences, and optional seed. Document the bundle per feature.

18. Recommended Starting Points

Defaults should be intentional. Use this table as a starting point, then tune with evals.

Use case	Temperature	Top-p	Notes
Classification / routing	0	1.0 or provider default	Validate label set strictly
Extraction to JSON	0	1.0	Schema validation required
Policy / compliance Q&A	0–0.2	0.8–0.95	Prioritize consistency
Support reply drafting	0.2–0.4	0.85–0.95	Human review still recommended
General assistant chat	0.5–0.7	0.9–0.95	Balance helpfulness and stability
Summarization	0.2–0.5	0.85–0.95	Lower if summaries feed workflows
Marketing copy variants	0.7–1.0	0.9–1.0	Expect divergence by design
Brainstorming / ideation	0.8–1.1	0.95–1.0	Guardrail claims and facts separately

These are starting points, not universal laws. Your model, prompt, and eval data should drive the final settings.

19. Designing Multi-Output UX

Sampling variation becomes a feature when the UX is designed for it. Poor multi-output UX feels broken; good multi-output UX feels intentional.

Common patterns

Pattern	What it does	Sampling implication
Regenerate	User asks for another answer	Needs enough variation to feel fresh
n variants upfront	Show 2–3 options immediately	Use n > 1 or multiple sampled calls
Compare view	Side-by-side diff of alternatives	Helpful for copy and strategy tools
Pin / accept answer	User locks one option	Turns variation into choice, not confusion
History of attempts	Prior outputs remain visible	Reduces distrust when answers differ

UX principles for multi-output features:

Label variation as intentional (“Try another version”).
Show what changed — not just a new blob of text.
Let users accept, edit, or reject an output.
Do not regenerate automatically in high-stakes flows.
Log which sampling config produced each variant.

PM takeaway

Regenerate is a product promise. If temperature is 0, regenerate may disappoint. If temperature is too high, regenerate may destabilize trust. Match the control to the promise.

20. Sampling and Evaluation

Evaluating an LLM feature once per test case is often insufficient when sampling is enabled. You need to measure both quality and consistency.

Metric / concept	What it tells you
Pass@1	Success rate on a single run
Pass@k	Probability that at least one of k runs succeeds
Consistency rate	How often key fields or labels match across runs
Format validity rate	How often output parses against schema
Semantic variance	Whether meaning changes, not just wording
Regression suite	Repeat critical prompts after model or config changes

Example: a classification feature may score 92% pass@1 but only 78% consistency across five runs at temperature 0.4. That gap is a product signal — not a footnote.

Eval practices for PMs

Run high-risk prompts multiple times before launch.
Track format failures separately from factual errors.
Compare configs side by side: temp 0 vs temp 0.3 vs temp 0.7.
Include “regenerate” scenarios in eval sets for creative features.
Version sampling config in eval reports.

PM takeaway

A feature can be “good on average” and still fail in production because variation is too high. Measure consistency explicitly.

21. Should Users Control Temperature?

Most consumer users do not think in terms of temperature or top-p. Exposing raw sliders creates confusion, support burden, and inconsistent outcomes.

A better pattern is user-facing presets that map to backend sampling configs.

User-facing preset	Backend posture	Best for
Precise	Temperature 0–0.2, tighter top-p	Factual answers, structured tasks
Balanced	Temperature 0.4–0.6, moderate top-p	Everyday assistant use
Creative	Temperature 0.8+, broader top-p	Brainstorming, drafting alternatives

Advanced controls can exist in developer settings or admin panels — not in the default consumer path.

PM takeaway

Translate knobs into outcomes. Users choose “More creative,” not “temperature 0.9.” Your team owns the mapping and validates it with evals.

22. PM Decision Framework

Use this framework when choosing sampling settings for a feature.

Step 1: Define the product job

Question	If yes	If no
Must output be repeatable?	Favor temperature 0	Variation may be acceptable
Does code parse the output?	Low temperature + validation	More flexibility allowed
Do users want alternatives?	Design multi-output UX	Optimize for single best answer
Is wording novelty valuable?	Moderate to high temperature	Keep temperature low
Is factual stability critical?	Low temperature + grounding	Creativity can be prioritized

Step 2: Choose the minimum sufficient randomness

Failure observed	First lever
Same answer feels stale	Slightly raise temperature or add regenerate
Answers change too much	Lower temperature and/or top-p
JSON breaks intermittently	Temperature 0 + schema validation
Regenerate feels identical	Increase variation modestly
Model ignores instructions	Fix prompt — not temperature first
Facts are wrong	RAG/tools — not temperature first

Step 3: Validate with repeated runs

Before shipping, run representative prompts multiple times. If unacceptable divergence appears, tighten sampling or add validation — do not rely on “usually works.”

23. What PMs Should Not Do

Avoid these common sampling mistakes.

Mistake	Why it is bad
Using one global temperature for the whole product	Different features need different stability
Raising temperature to fix factual errors	Randomness does not add truth
Testing a prompt only once before launch	Hides sampling variance
Exposing raw temperature sliders to all users	Creates confusion and support load
Assuming temperature 0 means “production safe”	Correctness and grounding still matter
Ignoring config drift across environments	Staging and prod behave differently
Shipping structured output without validation	Intermittent parser failures become incidents
Changing temperature and top-p together during debugging	You cannot tell what helped
Not logging generation settings	Impossible to debug user reports

Sampling is a powerful control. Used carelessly, it creates the illusion of progress while leaving the real product problem untouched.

24. Final Mental Model

The model predicts probabilities.
Sampling turns probabilities into picks.
Temperature controls how adventurous the picks are.
Top-p controls how many options are allowed.
Prompting shapes what is likely.
Validation ensures the product can trust the result.

If you remember only one thing from this chapter, remember this:

Match sampling to the job: low randomness for reliability, higher randomness for exploration — and never confuse variation with understanding.

That is the practical PM answer.

Chapter Summary

Concept	PM understanding
Sampling	How the model chooses the next token from probabilities
Why outputs vary	Randomness, settings, model version, and context differences
Greedy decoding	Always pick the top token — most deterministic common mode
Temperature	Creativity knob — lower is stable, higher is varied
Top-p	Limits choices to a nucleus of plausible tokens
Top-k	Fixed shortlist filter — simpler, less adaptive than top-p
Deterministic vs creative modes	Product features need different defaults
Structured output	Low temperature plus schema validation
Seeds and max tokens	Part of the full generation config bundle
Multi-output UX	Regenerate and variants should be intentional, not accidental
Evals	Measure pass@k and consistency — not just one-run accuracy
User controls	Presets like Precise / Balanced / Creative beat raw sliders
PM role	Choose the minimum sufficient randomness for each feature

Closing Thought

Users ask why the AI “changed its mind.” Often the answer is not that the model learned something new overnight. It is that the product allowed a different path through the same probability landscape.

That is not inherently bad. Creativity, brainstorming, and regenerate flows depend on exactly this behavior. But operational AI — routing, extraction, compliance, structured workflows — depends on the opposite.

Product managers do not need to implement sampling algorithms. They do need to decide which features deserve stability, which deserve exploration, and how the UX makes that contract clear.

Get that decision right, and temperature stops being a mysterious engineering knob. It becomes a product design choice — one you can explain, evaluate, and defend.

The next chapter in this module compares pre-training, fine-tuning, and RLHF — how each stage shapes what a model knows and how it behaves before sampling ever enters the picture.

The real PM lesson

Control randomness on purpose. Accidental variation is a bug; intentional variation is a feature.

Chapter navigation

← Previous

Chapter 6: Fine-Tuning vs Prompting — The PM Version

When to use prompting, fine-tuning, RAG, and tools — a PM decision framework for model optimization.

Read chapter → Next →

Chapter 8: Why LLMs Hallucinate — The PM Version

Why confident wrong answers happen — and how product teams reduce hallucination risk.

Read chapter →

← Chapter 06 Chapter 08 → Back to Module Back to Blog AI Learning