Chapter 05 · Module 02 · Beginner–Intermediate · 28–32 min

Chapter 5: Cost per Token, Latency, and Model Tier Tradeoffs — The PM Version

Model LLM spend like infrastructure—tokens, tiers, routing, caching, and latency—with formulas you can defend in a business review.

Book: AI Learning Beginner–Intermediate 28–32 min
Start reading Back to module
Volume Tokens Tier Monthly bill

Every feature has a unit economics story—price it before you ship it

Introduction

In Module 01, Chapter 3, we covered tokens and context windows—the working memory of an LLM call.

In Chapter 4, we covered benchmarks and product-specific evals—how to compare models on your tasks, not leaderboard hype.

This chapter connects quality decisions to money and speed. You will not find provider price tables here. Prices change; contracts differ; cached and batch rates have their own rules. What stays useful is the structure of the bill.

Monthly Cost = (Input Tokens × Input Price) + (Output Tokens × Output Price) + (Cached Tokens × Cached Price) + Tool/Infra

Plug in current list prices from your vendor’s pricing page when you build a memo. The PM job is to own the variables—volume, token mix, tier, routing, cache hit rate—not to memorize yesterday’s cents per million.

PM takeaway

Cost is a product constraint, not a finance surprise after launch. Model it in the PRD the same way you model API rate limits.

1. Why Cost Matters for PMs

Traditional SaaS features have near-zero marginal cost per extra click. LLM features do not. Each successful completion burns input tokens, output tokens, and often retrieval, tools, and human review on top.

Cost shows up in four places PMs care about:

  • Unit economics — Can this feature survive at expected usage and price?
  • Roadmap tradeoffs — A “smarter” default model may erase margin on a high-volume workflow.
  • Reliability budgets — Agent loops and long outputs multiply spend per user action.
  • Procurement conversations — You need a defendable model, not “we picked the famous one.”
Stakeholder questionWhat you answer with
“What if usage 10×?”Scaled monthly formula + tier/routing plan
“Why is support AI expensive?”Output length + RAG input + tool round-trips
“Can we use the frontier model everywhere?”Cost per successful task by tier
“Is caching worth engineering?”Break-even on cache hit rate vs build time

PM takeaway

If you cannot sketch monthly cost at 10k and 1M calls, you are not ready to commit to a default model in production.

2. Token Types and What Gets Billed

Providers bill by token category. Names differ by vendor; the PM concepts are stable.

CategoryTypical contentsPM note
InputSystem prompt, user message, RAG chunks, tool results fed back inOften the largest line item for grounded apps
OutputModel completion text (and sometimes “reasoning” tokens)Usually priced higher per token than input
Cached inputPrefix repeated across calls (system prompt, doc index)Separate rate Pcache; hits reduce average cost
Tool / infraEmbeddings, rerankers, speech, vision units, vector DB, egressNot always in the LLM invoice—still COGS

Symbolic unit prices (per million tokens or per your vendor’s unit):

SymbolMeaning
PinPrice per input token (or per 1M input tokens)
PoutPrice per output token
PcachePrice per cached input token (read from cache)
CtoolFixed + variable tool/infra cost per call

Multimodal and long-context models may bill images, audio seconds, or context tiers separately. When you spec a feature, list every metered dimension, not only “GPT call.”

3. Why Output Tokens Matter So Much

Teams optimize prompts for input size—and forget that verbose outputs dominate many bills. Pout is often several times Pin on the same model family.

High-output patterns PMs should flag in reviews:

  • “Explain your reasoning in detail” on every ticket
  • JSON wrapped in long natural-language preambles
  • Agent scratchpads kept in the user-visible thread
  • Unbounded “summarize this 40-page PDF” without length caps
  • Multiple draft variants per request (A/B at full generation cost)
Design leverEffect on bill
Max output tokensHard cap on worst case per call
Structured output (schema)Shorter, parseable completions
Two-step: classify then generateSmall model for routing; large only when needed
Streaming + early stopStop when confidence threshold met (with care)

Watch the denominator

Finance will ask cost per successful outcome. A cheap model with 40% rework can cost more than a pricier model with 90% first-pass success.

4. Unit Economics per Call

Define one representative inference call for your workflow:

  • Tin = input tokens per call
  • Tout = output tokens per call
  • Tcache = cached input tokens billed at cache rate (often a fraction of Tin)
  • Tin,uncached = TinTcache (when cache accounting splits uncached vs cached)

Cost per call (symbolic):

Ccall = (Tin,uncached × Pin) + (Tcache × Pcache) + (Tout × Pout) + Ctool

Illustrative single-call math (plug your vendor prices)
VariableExample valueLine cost
Tin,uncached3,0003,000 × Pin
Tcache2,0002,000 × Pcache
Tout500500 × Pout
Ctoolembedding + rerankfixed per call
TotalCcall

Measure Tin and Tout from staging logs (tokenizer or provider usage fields). Guessing “about 2k tokens” in a exec deck is how features get underpriced.

5. Monthly Cost Formula

Let N = successful API calls per month (or per business unit). Let overhead Cfixed cover dashboards, support contracts, reserved capacity, and fine-tune hosting.

Monthly Cost = N × Ccall + Cfixed

Expanded:

Monthly Cost = N × [(Input Tokens × Pin) + (Output Tokens × Pout) + (Cached Tokens × Pcache) + Ctool] + Cfixed

For multi-step agents, sum Ccall across steps or use Crun per user task × tasks per month.

PM takeaway

Publish three scenarios—pilot, target, stress—each with its own N and token assumptions. Leadership remembers the stress number.

6. Scaling Scenarios: 10k, 100k, and 1M Calls

Use the same Ccall; only N changes. This table is structural—replace symbols with priced values from your contract.

Monthly calls (N)Inference subtotal+ CfixedPM question to ask
10,00010,000 × CcallMonthly CostDoes pilot budget match one team’s workflow?
100,000100,000 × CcallMonthly CostDo we need routing before wide rollout?
1,000,0001,000,000 × CcallMonthly CostDoes gross margin hold at list price × conversion?

Sensitivity knobs (rank these for your feature):

  1. ±20% on Tout (verbosity policy)
  2. ±30% on Tin (RAG chunk count)
  3. Cache hit rate on system prompt (0% vs 70% hit)
  4. Share of calls routed to mid-tier vs frontier
  5. Agent steps per task (1 vs 4 LLM calls)
ScenarioCcall vs baselineAt N = 1M
Baseline1.0×1.0M × Ccall
Verbose outputs (+25% Tout)~1.15–1.25×Proportional increase
70% cache hit on 40% of inputDepends on PcacheRecompute with split input
50% calls on mid-tier (½ Pin, ½ Pout)Blended rateWeighted average tiers

7. Model Tiers: Frontier, Mid, and Small

Tiers are product choices, not vanity labels. Map tiers to workflows—not to “we always use the best.”

TierTypical strengthsTypical PM useCost / latency
FrontierHard reasoning, long context, multimodalEscalation, complex drafts, high-stakes review assistHighest Pin/Pout; slower TTFT often
MidStrong general quality at lower priceDefault copilot, Q&A with RAG, most chatBalanced unit economics
SmallFast classification, extraction, routingIntent detect, JSON slot-fill, guardrail checksLowest per token; may need more repair loops

Relative pricing is usually order-of-magnitude, not 5%: small may be 10–30× cheaper per token than frontier on the same provider stack. Exact ratios belong in your memo with dated P values.

PM takeaway

Default to mid tier for volume paths; reserve frontier for steps that fail evals on mid—or for explicit user “deep analysis” opt-in.

8. Routing and Cascade Architectures

Routing sends each request to a model tier based on rules or a cheap classifier. Cascades try a cheap path first, then escalate only when confidence is low or eval checks fail.

PatternHow it saves moneyPM risk
Intent router → small / mid / largeMost traffic avoids frontierMis-routing on ambiguous queries
Cascade: small answer → judge → escalatePay frontier only on hard subsetLatency stacks on escalations
Tool-first: DB lookup before LLMZero tokens when lookup succeedsProduct feels “dumb” if routing too aggressive
Human queue for low confidenceAvoid repeated auto-retriesOps cost moves, does not disappear

Model routing in the PRD should specify: triggers for escalation, max steps per user task, and eval gates that block silent downgrade of quality.

9. Caching Economics

Prompt caching (provider feature) discounts repeated long prefixes—system prompts, tool definitions, stable policy docs. You pay a cache write cost once, then reduced Pcache on hits.

ConceptPM implication
Cacheable prefixPut stable content first; volatile user text last
Hit rateHigh-volume features with shared system prompt benefit most
TTL / invalidationPolicy change must bust cache or you serve stale instructions
Application cacheStore embeddings and retrieval results—not only LLM prefix cache

Break-even sketch: if cache engineering costs E engineer-weeks, compare monthly savings N × Tprefix × hit_rate × (PinPcache) against E.

PM takeaway

Caching is a product architecture decision—stable system prompt design—not only an infra ticket.

10. Latency vs Cost

Users feel time to first token (TTFT) and total time to last token. Finance feels monthly tokens. They pull in opposite directions.

UX needTypical leverCost note
Chat feels instantStreaming, smaller model, shorter contextSmaller model may need second call
Batch overnight jobsBatch API, lower priority queueLower P per token; not for interactive UX
Complex analysisFrontier + long outputAccept higher Ccall and p95 latency
Global usersRegional endpointsMay affect data residency and price list

Spec latency budgets per workflow tier: e.g. “routing < 300ms,” “user-visible draft starts < 2s TTFT,” “full report < 45s.” Pair each budget with an allowed tier and max tokens.

11. Cost Risks and Failure Modes

  • Retry storms — Failed tool calls that re-invoke the LLM with full context each time.
  • Context creep — Chat history grows unbounded; every turn re-sends the thread.
  • Agent loops — “Try again” until max steps without a stopping rule.
  • Free-tier abuse — Power users on unlimited plans burning frontier models.
  • Eval gap — Cheap model ships; human rework costs more than inference saved.
  • Price list change — No alert when vendor updates Pin/Pout.
  • Shadow AI — Teams paste production data into consumer chat tools (different risk, still cost and compliance).
ControlWhat it limits
Per-user / per-tenant token budgetsRunaway sessions
Max steps per agent taskLoop spend
Cost dashboards by feature flagSurprises at 1M scale
Canary on price changeMargin erosion

12. Hands-On Exercise

Pick one real or hypothetical AI feature (e.g. internal policy Q&A, code review assist, claims note draft). Complete:

  1. Log or estimate Tin, Tout, and tool costs for one happy-path call.
  2. Look up current Pin, Pout, Pcache on your vendor site—date-stamp them in a footnote.
  3. Compute Ccall and monthly cost at N = 10k, 100k, 1M.
  4. Propose one routing rule and one output cap that cut cost ≥20% with acceptable eval risk.
  5. Write one sentence for finance: “At 100k calls/month we spend approximately ___ assuming ___.”

Done when

You can explain the biggest cost driver in plain language (usually output length, RAG input, or agent steps—not “the model brand”).

13. Decision Framework Matrix

Use this when choosing default tier and architecture for a new workflow:

DimensionLean cheapBalancedLean quality
Task difficultyClassification, extractionRAG Q&A, drafting with reviewMulti-step reasoning, rare edge cases
Default tierSmallMidFrontier (or cascade into it)
Context strategyMinimal promptRAG top-k tunedLong context + tools
Output policyStrict max tokens + schemaTemplates + citationsLonger drafts, human edit
Latency target< 1s TTFT2–5s acceptableQuality over speed
When margin failsBatch, cache, cut featuresRoute + compress promptsCharge premium tier or limit volume

14. Common PM Mistakes

MistakeWhy it hurts
Quoting list price without token counts“$3 per million” is meaningless alone
Assuming input is the whole billOutput and tools often dominate
Frontier as default for all trafficMargin collapses at scale
Ignoring failed callsRetries double spend
No observability by featureCannot attribute overrun
Benchmark cost on demo prompts onlyProduction prompts are 3–10× larger
Hardcoding prices in code/docsStale numbers mislead leadership

Chapter Summary

ConceptPM understanding
FormulaMonthly = N × (input + output + cache + tools) + fixed
SymbolsPin, Pout, Pcache from vendor; refresh often
Output tokensOften the silent margin killer
ScaleModel 10k / 100k / 1M with same Ccall
TiersSmall / mid / frontier mapped to workflows
Routing & cacheArchitecture levers, not afterthoughts
LatencyTrade speed, quality, and Ccall explicitly
Next stepChapter 6 — model selection memo

Closing Thought

The PM who wins budget and trust shows the spreadsheet: tokens per call, price symbols dated this quarter, three volume scenarios, and a plan when the stress case happens.

Model choice without economics is branding. Economics without evals is false savings. You need both—which is why the next chapter is the capstone memo that ties benchmarks, cost, and risk into one decision document.

The real PM lesson

Own the formula. Refresh the prices. Defend the tier mix with evals and unit economics together.

Chapter navigation

← Previous

Chapter 4: Benchmarks — What PMs Should and Shouldn't Trust — The PM Version

MMLU, HumanEval, SWE-bench, Arena, and how to design evals that match your product.

Read chapter →
Next →

Chapter 6: The Model Selection Memo — The PM Version

Capstone templates and an example memo for a claims shortfall assistant.

Read chapter →