Chapter 07 · Module 01 · Beginner–Intermediate · 22–26 min

Chapter 7: Temperature, Top-p, and Sampling — The PM Version

Why the same prompt can produce different answers — and how temperature and top-p shape product reliability.

Book: AI Learning Beginner–Intermediate 22–26 min
Start reading Back to module
Tokens Temperature Top-p Output

Probabilities become picks — and picks become product behavior

Introduction

In Chapter 2, we saw how transformers became Large Language Models.

In Chapter 3, we covered tokens and context windows — the working memory of an AI system.

In Chapter 4, we introduced AI safety, RLHF, and Constitutional AI — why alignment matters and how human feedback shapes behavior.

In Chapter 5, we went deeper into InstructGPT and the RLHF pipeline that turned base models into instruction-following assistants.

In Chapter 6, we compared prompting, fine-tuning, RAG, and tools — the layers that shape what the model knows and how it behaves.

Now we turn to a question that confuses users, frustrates QA teams, and surprises many product managers:

Why does the same prompt sometimes produce different answers?

The answer is not randomness for its own sake. It is sampling — the process by which a language model turns probability scores into the next token, one token at a time. Temperature, top-p, and related controls determine how adventurous or conservative that process is.

This matters for product managers because sampling settings directly affect reliability, regression risk, user trust, evaluation design, structured output stability, and whether your feature feels “consistent” or “creative.” A support bot and a brainstorming assistant should not use the same generation settings — even if they share the same base model.

This chapter explains temperature, top-p, and sampling from a PM lens — not as a math lecture, but as a practical product control framework.

The simple PM version

Low temperature and tight top-p for reliability.
Higher temperature for variation and creativity.
Match sampling to the product job — not to what feels interesting in a demo.

1. Why the Same Prompt Gives Different Answers

A user submits the same question twice. The model gives two different answers. A PM runs the same eval prompt ten times. Three runs pass; seven fail. A demo looks great on Monday and flaky on Tuesday.

This is normal behavior for most LLM products — unless you deliberately configure the system for repeatability.

Common reasons outputs vary

CauseWhat it means for PMs
Sampling is enabledThe model randomly picks among likely tokens
Temperature is above zeroLess likely tokens get a real chance
Top-p allows multiple pathsOutput can branch early and diverge
No seed or seed not supportedRuns are not guaranteed to repeat
Model or API version changedProbabilities shifted under the hood
Context changed slightlyEven small input differences matter
Streaming or truncationMax tokens or stop sequences cut output differently

Users often interpret variation as “the AI is confused.” Engineers often interpret it as “expected sampling behavior.” Product managers need a third view: Is this variation acceptable for this feature?

PM takeaway

Different answers are not always a bug. Sometimes they are a product choice. Your job is to decide whether a feature should be repeatable, exploratory, or both — and configure sampling accordingly.

2. From Probabilities to Picks: How Generation Works

At each step, the model does not “know” the full answer upfront. It predicts a probability distribution over possible next tokens, then picks one. That token is appended to the output. The process repeats until the model stops.

A simplified view of one generation step:

StepWhat happens
1Model reads prompt + generated text so far
2Model scores every candidate next token
3Scores become probabilities
4Sampling rules filter or reshape those probabilities
5One token is selected
6Loop continues until stop condition

This is why small early differences compound. If the model chooses “However” instead of “Therefore” on token three, the rest of the answer may follow a different path.

PM analogy

Think of autocomplete on your phone, but at every word the keyboard ranks many options and sometimes picks a less obvious one on purpose. That “on purpose” part is sampling policy — and it is configurable.

3. What Is Sampling?

Sampling is the rule for choosing the next token from a probability distribution. Without sampling controls, a model would always pick the single highest-probability token — predictable, but often repetitive and dull. With sampling, the model can explore plausible alternatives.

TermPlain-language meaning
Probability distributionRanked list of likely next tokens
SamplingHow one token is chosen from that list
Deterministic decodingAlways pick the top choice
Stochastic decodingAllow randomness among likely choices
SeedFixed starting point for randomness, when supported

Sampling is not the same as intelligence. It does not make the model smarter. It changes how the model expresses uncertainty — more conservatively or more creatively.

For PMs, the key question is:

Should this feature explore alternatives, or should it converge on one reliable answer?

4. Greedy Decoding and Deterministic Outputs

Greedy decoding means always selecting the highest-probability token at each step. It is the most deterministic common strategy — though “deterministic” still depends on model version, infrastructure, and whether other randomness exists.

TraitEffect
Most predictable pathSame prompt often yields same output
Low creativityRepetitive phrasing is more likely
Good for structured tasksClassification, extraction, JSON-like outputs
Not a guarantee of correctnessTop choice can still be wrong

In practice, many API products achieve near-greedy behavior with temperature = 0 (or the provider’s equivalent “deterministic” setting). Some providers also expose an explicit seed parameter for stronger repeatability.

PM takeaway

If your feature breaks when wording changes slightly — policy Q&A, form filling, routing labels, compliance summaries — start with low or zero temperature. Do not assume users want creative variation in operational workflows.

5. Temperature: The Creativity Knob

Temperature scales the model’s probability distribution before sampling. Low temperature sharpens the distribution — top tokens dominate. High temperature flattens it — lower-ranked tokens become more likely.

TemperatureTypical behaviorProduct feel
0Most deterministicReliable, repetitive
0.1–0.3Mostly conservativeStable assistant
0.5–0.7Balanced variationGeneral-purpose chat
0.8–1.0More diverse wordingBrainstorming, drafting
> 1.0High randomnessExperimental, risky for production

Temperature does not change what the model knows. It changes how willing the system is to choose less obvious continuations.

Example

Prompt: “Suggest three subject lines for a renewal email.” At temperature 0.2, subject lines may be safe and similar across runs. At temperature 0.9, subject lines may vary widely — useful for ideation, risky if brand tone must stay tightly controlled.

PM takeaway

Temperature controls expressiveness, not factual accuracy. A confident wrong answer at temperature 0 is still wrong.

6. Top-p (Nucleus Sampling)

Top-p, also called nucleus sampling, limits choices to the smallest set of tokens whose combined probability mass reaches p. Instead of considering every possible next token, the model samples only from the “nucleus” of plausible options.

Top-p valueWhat it does
0.1Very narrow set of candidates — conservative
0.5Moderate candidate pool
0.9Broader pool — more variation
1.0No nucleus filter — full distribution after other rules

Top-p is often more adaptive than top-k because the candidate set size changes with context. When one token is overwhelmingly likely, top-p may effectively behave like greedy decoding. When probabilities are spread out, top-p allows more alternatives.

PM takeaway

Top-p is a quality filter on randomness. It helps prevent the model from selecting bizarre low-probability tokens while still allowing useful variation. Many production systems use moderate top-p with low temperature.

7. Top-k: A Simpler Filter

Top-k sampling restricts selection to the k highest-probability tokens, then samples from that fixed shortlist. For example, top-k = 40 means only the top 40 candidates are eligible.

ApproachHow it filtersPM note
Top-kFixed number of candidatesSimple, but less adaptive
Top-pDynamic candidate set by probability massUsually preferred in modern APIs
TemperatureReshapes all probabilitiesOften used together with top-p

Some providers expose top-k; others emphasize top-p. As a PM, you do not need to master every decoding algorithm. You need to know which controls your stack exposes and what behavior they produce in evals.

In most product discussions, top-p and temperature are the primary levers. Top-k matters when your engineering team or model host explicitly recommends it.

8. Deterministic vs Creative Product Modes

Mature AI products often behave like they have modes — even if users never see the underlying parameters.

Product modeSampling postureExample features
Deterministic / preciseLow temperature, low top-p, optional seedClassification, extraction, routing, JSON output
Balanced assistantModerate temperature and top-pGeneral chat, summarization, email drafting
Creative / exploratoryHigher temperature, broader top-pBrainstorming, marketing copy, story ideas
Multi-option generationModerate to high variation with n > 1Regenerate, compare alternatives, A/B drafts

The mistake is using one global setting for the entire product. A single “temperature = 0.7” default may be fine for open chat and harmful for structured workflows.

PM takeaway

Design sampling per feature, not per model. Your support classifier and your campaign copy generator should not share the same defaults by accident.

9. Reliability, Regression, and User Trust

Users trust AI features when outputs feel stable enough to act on. Sampling instability erodes trust quickly — especially in high-stakes workflows.

How sampling affects trust

User observationLikely sampling issueProduct risk
“It gave a different answer this time”Non-zero temperature or no seedPerceived unreliability
“It was correct in staging”Eval ran once, not multiple timesFalse confidence
“Regenerate made it worse”High variation is working as configuredUX mismatch
“The JSON broke overnight”Sampling + parser fragilityProduction incident
“Demo and prod behave differently”Different defaults across environmentsRelease risk

Reliability is not only about correctness. It is about predictable behavior under repeated use. A PM should define acceptable variance before launch — not after users complain.

PM takeaway

For operational features, document expected repeatability. If variation is intentional, explain it in the UX: “Regenerate for alternatives,” not silent inconsistency.

10. When Low Temperature Matters

Use low temperature when the product job rewards consistency over novelty.

Use caseWhy low temperature
Intent classificationLabels should not drift between runs
Data extractionField values must stay stable
Policy Q&AUsers expect the same question to get the same guidance
Structured JSON outputReduces format and key variation
Routing and triageDownstream workflows depend on stable labels
Compliance summariesLanguage should not change meaning on rerun
Automated eval baselinesReduces noise when comparing prompt versions

Low temperature reduces randomness. It does not eliminate all failure modes. You still need good prompts, grounding, validation, and evals.

  • Start with temperature 0 for structured workflows.
  • Increase only if outputs feel too repetitive and repetition is a user problem.
  • Pair low temperature with schema validation or post-processing checks.

11. When Higher Temperature Helps

Higher temperature is useful when the product goal is exploration, diversity, or fresh wording — not strict repeatability.

Use caseWhy higher temperature
BrainstormingUsers want multiple distinct ideas
Marketing copy variantsVariation helps compare tone and hooks
Creative writing aidsPredictable output feels stale
Workshop facilitationDiverse prompts spark discussion
Regenerate buttonUsers expect a meaningfully different alternative
Idea expansionExploring adjacent concepts is the value

The product contract matters. If the UI says “Generate alternatives,” variation is a feature. If the UI says “Get the answer,” variation feels like a bug.

PM takeaway

Higher temperature is not free creativity. It increases the chance of off-brand language, unsupported claims, and parser failures. Use it where divergence is desired — and guardrail elsewhere.

12. Using Temperature and Top-p Together

Temperature and top-p are often configured together. Temperature reshapes probabilities; top-p limits which tokens are eligible. Used well, they balance stability and flexibility.

ConfigurationTypical resultGood for
Temp 0 + top-p 1.0Near-greedy, most stableStructured tasks, eval baselines
Temp 0.2 + top-p 0.9Mostly stable with slight phrasing flexibilitySupport replies, summaries
Temp 0.7 + top-p 0.95Noticeable variationGeneral assistant chat
Temp 1.0 + top-p 0.95High diversityBrainstorming, creative drafts
Temp 0 + top-p 0.5Very constrained even before greedy pickHard stability requirements

Do not tune both blindly. Change one variable at a time in evals. Otherwise you cannot explain why behavior improved or regressed.

Practical starting pattern

For production assistants: moderate top-p (0.8–0.95) with low-to-moderate temperature (0.2–0.5), tuned per feature.
For operational extraction/classification: temperature 0 with validation.

13. Sampling Across Models and API Versions

Identical sampling settings do not guarantee identical outputs across model families, model sizes, or provider updates. The probability landscape changes when the model changes.

ChangeWhat may happen
New model versionSame prompt, different default behavior
Switch from GPT to Claude to GeminiOptimal temperature differs by stack
Smaller model replaces larger modelMore brittle outputs at same settings
Provider changes decoding defaultsSilent behavior shift
Fine-tuned model deployedDistribution narrows or shifts

This is why sampling settings should be versioned alongside prompts and eval suites. When you upgrade a model, re-run evals at the intended temperature and top-p — not just once, but multiple times per test case if variation matters.

PM takeaway

Treat sampling config as part of release management. “We kept temperature at 0.2” is not enough if the model underneath changed.

14. Sampling vs Prompting and Fine-Tuning

Sampling controls how the model chooses among plausible continuations. Prompting and fine-tuning shape what the model is likely to say in the first place. These layers solve different problems.

ProblemSampling changePrompt / fine-tune change
Output varies too much run-to-runLower temperature / top-pPrompt may help, but may not fix randomness
Output is repetitive and dullRaise temperature modestlyPrompt examples can also increase diversity
Wrong task behaviorSampling will not fixBetter instructions or fine-tuning
Wrong factsSampling will not fixRAG, tools, grounding
Inconsistent JSON keysLower temperature helpsPrompt schema + validation helps more
Off-brand toneSlight effectPrompt or fine-tuning is primary fix

A common PM mistake is trying to solve a behavior problem by tweaking temperature. If the model consistently misunderstands the task, fix the prompt, context, or training — not randomness alone.

PM takeaway

Prompting shapes intent. Fine-tuning shapes learned behavior. Sampling shapes variation. Use the right layer for the failure you actually see.

15. Common Product Scenarios

Here is how sampling choices often map to real product features.

FeatureRecommended postureWhy
Customer support classifierTemperature 0Stable routing labels
Claims field extractionTemperature 0 + validationParser and workflow depend on structure
FAQ assistantLow temperatureUsers expect consistent guidance
Email rewrite assistantLow to moderateSome phrasing flexibility, not chaos
Subject line generatorModerate to highVariation is the product value
PRD brainstorming copilotModerate to highDiverse ideas help discovery
Code explanation toolLowAccuracy and clarity over novelty
Regenerate button in chatModerate+Users expect a different alternative

Notice that the same product can contain multiple features with different sampling policies. That is normal and desirable.

16. Structured Output and Sampling

Structured output — JSON, tables, fixed field schemas — is especially sensitive to sampling variation. A single different key, trailing comma, or reordered field can break downstream parsers.

Best practice for structured workflows:

  • Use low or zero temperature.
  • Specify schema clearly in the prompt or use provider schema modes when available.
  • Validate output before passing it to business logic.
  • Retry with repair prompt only when validation fails — not on every request.

Example: claim summary JSON

Prompt asks for structured output. At low temperature, the model reliably returns parseable JSON:

{
  "claim_id": "CLM-2026-10482",
  "summary": "Hospitalization claim for appendectomy. Discharge summary present.",
  "missing_documents": ["itemized_bill"],
  "recommended_action": "request_missing_docs",
  "confidence": "medium"
}

At higher temperature, the same prompt may produce acceptable content in a broken envelope:

  • Different key names (missingDocs vs missing_documents)
  • Extra commentary outside the JSON block
  • Valid JSON with a different schema shape
  • Markdown fences that parsers do not expect

PM takeaway

For anything consumed by code, treat sampling as part of the contract. Low temperature reduces variation; schema validation catches what sampling cannot.

17. Max Tokens, Stop Sequences, and Seeds

Sampling is not the only generation control PMs should know. These adjacent parameters also shape product behavior.

ParameterWhat it doesPM relevance
Max tokensLimits response lengthPrevents runaway cost and overly long outputs
Stop sequencesEnds generation when specific text appearsUseful for section boundaries and tool-call formats
SeedFixes randomness source when supportedImproves repeatability for tests and audits
Presence / frequency penaltiesReduces repetitionCan help long-form drafting features
n / best_ofGenerate multiple candidatesPowers compare-and-choose UX

Max tokens is a product decision, not just an engineering safeguard. If users need concise support replies, a tight max tokens limit enforces brevity even when the model would otherwise ramble.

Seeds are valuable for QA and regression testing, but support varies by provider and may not promise identical outputs across infrastructure changes. Use seeds to reduce variance in evals — not as a permanent substitute for validation.

PM takeaway

Think of generation config as a bundle: temperature, top-p, max tokens, stop sequences, and optional seed. Document the bundle per feature.

18. Recommended Starting Points

Defaults should be intentional. Use this table as a starting point, then tune with evals.

Use caseTemperatureTop-pNotes
Classification / routing01.0 or provider defaultValidate label set strictly
Extraction to JSON01.0Schema validation required
Policy / compliance Q&A0–0.20.8–0.95Prioritize consistency
Support reply drafting0.2–0.40.85–0.95Human review still recommended
General assistant chat0.5–0.70.9–0.95Balance helpfulness and stability
Summarization0.2–0.50.85–0.95Lower if summaries feed workflows
Marketing copy variants0.7–1.00.9–1.0Expect divergence by design
Brainstorming / ideation0.8–1.10.95–1.0Guardrail claims and facts separately

These are starting points, not universal laws. Your model, prompt, and eval data should drive the final settings.

19. Designing Multi-Output UX

Sampling variation becomes a feature when the UX is designed for it. Poor multi-output UX feels broken; good multi-output UX feels intentional.

Common patterns

PatternWhat it doesSampling implication
RegenerateUser asks for another answerNeeds enough variation to feel fresh
n variants upfrontShow 2–3 options immediatelyUse n > 1 or multiple sampled calls
Compare viewSide-by-side diff of alternativesHelpful for copy and strategy tools
Pin / accept answerUser locks one optionTurns variation into choice, not confusion
History of attemptsPrior outputs remain visibleReduces distrust when answers differ

UX principles for multi-output features:

  1. Label variation as intentional (“Try another version”).
  2. Show what changed — not just a new blob of text.
  3. Let users accept, edit, or reject an output.
  4. Do not regenerate automatically in high-stakes flows.
  5. Log which sampling config produced each variant.

PM takeaway

Regenerate is a product promise. If temperature is 0, regenerate may disappoint. If temperature is too high, regenerate may destabilize trust. Match the control to the promise.

20. Sampling and Evaluation

Evaluating an LLM feature once per test case is often insufficient when sampling is enabled. You need to measure both quality and consistency.

Metric / conceptWhat it tells you
Pass@1Success rate on a single run
Pass@kProbability that at least one of k runs succeeds
Consistency rateHow often key fields or labels match across runs
Format validity rateHow often output parses against schema
Semantic varianceWhether meaning changes, not just wording
Regression suiteRepeat critical prompts after model or config changes

Example: a classification feature may score 92% pass@1 but only 78% consistency across five runs at temperature 0.4. That gap is a product signal — not a footnote.

Eval practices for PMs

  • Run high-risk prompts multiple times before launch.
  • Track format failures separately from factual errors.
  • Compare configs side by side: temp 0 vs temp 0.3 vs temp 0.7.
  • Include “regenerate” scenarios in eval sets for creative features.
  • Version sampling config in eval reports.

PM takeaway

A feature can be “good on average” and still fail in production because variation is too high. Measure consistency explicitly.

21. Should Users Control Temperature?

Most consumer users do not think in terms of temperature or top-p. Exposing raw sliders creates confusion, support burden, and inconsistent outcomes.

A better pattern is user-facing presets that map to backend sampling configs.

User-facing presetBackend postureBest for
PreciseTemperature 0–0.2, tighter top-pFactual answers, structured tasks
BalancedTemperature 0.4–0.6, moderate top-pEveryday assistant use
CreativeTemperature 0.8+, broader top-pBrainstorming, drafting alternatives

Advanced controls can exist in developer settings or admin panels — not in the default consumer path.

PM takeaway

Translate knobs into outcomes. Users choose “More creative,” not “temperature 0.9.” Your team owns the mapping and validates it with evals.

22. PM Decision Framework

Use this framework when choosing sampling settings for a feature.

Step 1: Define the product job

QuestionIf yesIf no
Must output be repeatable?Favor temperature 0Variation may be acceptable
Does code parse the output?Low temperature + validationMore flexibility allowed
Do users want alternatives?Design multi-output UXOptimize for single best answer
Is wording novelty valuable?Moderate to high temperatureKeep temperature low
Is factual stability critical?Low temperature + groundingCreativity can be prioritized

Step 2: Choose the minimum sufficient randomness

Failure observedFirst lever
Same answer feels staleSlightly raise temperature or add regenerate
Answers change too muchLower temperature and/or top-p
JSON breaks intermittentlyTemperature 0 + schema validation
Regenerate feels identicalIncrease variation modestly
Model ignores instructionsFix prompt — not temperature first
Facts are wrongRAG/tools — not temperature first

Step 3: Validate with repeated runs

Before shipping, run representative prompts multiple times. If unacceptable divergence appears, tighten sampling or add validation — do not rely on “usually works.”

23. What PMs Should Not Do

Avoid these common sampling mistakes.

MistakeWhy it is bad
Using one global temperature for the whole productDifferent features need different stability
Raising temperature to fix factual errorsRandomness does not add truth
Testing a prompt only once before launchHides sampling variance
Exposing raw temperature sliders to all usersCreates confusion and support load
Assuming temperature 0 means “production safe”Correctness and grounding still matter
Ignoring config drift across environmentsStaging and prod behave differently
Shipping structured output without validationIntermittent parser failures become incidents
Changing temperature and top-p together during debuggingYou cannot tell what helped
Not logging generation settingsImpossible to debug user reports

Sampling is a powerful control. Used carelessly, it creates the illusion of progress while leaving the real product problem untouched.

24. Final Mental Model

The model predicts probabilities.
Sampling turns probabilities into picks.
Temperature controls how adventurous the picks are.
Top-p controls how many options are allowed.
Prompting shapes what is likely.
Validation ensures the product can trust the result.

If you remember only one thing from this chapter, remember this:

Match sampling to the job: low randomness for reliability, higher randomness for exploration — and never confuse variation with understanding.

That is the practical PM answer.

Chapter Summary

ConceptPM understanding
SamplingHow the model chooses the next token from probabilities
Why outputs varyRandomness, settings, model version, and context differences
Greedy decodingAlways pick the top token — most deterministic common mode
TemperatureCreativity knob — lower is stable, higher is varied
Top-pLimits choices to a nucleus of plausible tokens
Top-kFixed shortlist filter — simpler, less adaptive than top-p
Deterministic vs creative modesProduct features need different defaults
Structured outputLow temperature plus schema validation
Seeds and max tokensPart of the full generation config bundle
Multi-output UXRegenerate and variants should be intentional, not accidental
EvalsMeasure pass@k and consistency — not just one-run accuracy
User controlsPresets like Precise / Balanced / Creative beat raw sliders
PM roleChoose the minimum sufficient randomness for each feature

Closing Thought

Users ask why the AI “changed its mind.” Often the answer is not that the model learned something new overnight. It is that the product allowed a different path through the same probability landscape.

That is not inherently bad. Creativity, brainstorming, and regenerate flows depend on exactly this behavior. But operational AI — routing, extraction, compliance, structured workflows — depends on the opposite.

Product managers do not need to implement sampling algorithms. They do need to decide which features deserve stability, which deserve exploration, and how the UX makes that contract clear.

Get that decision right, and temperature stops being a mysterious engineering knob. It becomes a product design choice — one you can explain, evaluate, and defend.

The next chapter in this module compares pre-training, fine-tuning, and RLHF — how each stage shapes what a model knows and how it behaves before sampling ever enters the picture.

The real PM lesson

Control randomness on purpose. Accidental variation is a bug; intentional variation is a feature.

Chapter navigation

← Previous

Chapter 6: Fine-Tuning vs Prompting — The PM Version

When to use prompting, fine-tuning, RAG, and tools — a PM decision framework for model optimization.

Read chapter →
Next →

Chapter 8: Why LLMs Hallucinate — The PM Version

Why confident wrong answers happen — and how product teams reduce hallucination risk.

Read chapter →