Cost per Token, Latency, and Model Tier Tradeoffs — The PM Version

Introduction

In Module 01, Chapter 3, we covered tokens and context windows—the working memory of an LLM call.

In Chapter 4, we covered benchmarks and product-specific evals—how to compare models on your tasks, not leaderboard hype.

This chapter connects quality decisions to money and speed. You will not find provider price tables here. Prices change; contracts differ; cached and batch rates have their own rules. What stays useful is the structure of the bill.

Monthly Cost = (Input Tokens × Input Price) + (Output Tokens × Output Price) + (Cached Tokens × Cached Price) + Tool/Infra

Plug in current list prices from your vendor’s pricing page when you build a memo. The PM job is to own the variables—volume, token mix, tier, routing, cache hit rate—not to memorize yesterday’s cents per million.

PM takeaway

Cost is a product constraint, not a finance surprise after launch. Model it in the PRD the same way you model API rate limits.

1. Why Cost Matters for PMs

Traditional SaaS features have near-zero marginal cost per extra click. LLM features do not. Each successful completion burns input tokens, output tokens, and often retrieval, tools, and human review on top.

Cost shows up in four places PMs care about:

Unit economics — Can this feature survive at expected usage and price?
Roadmap tradeoffs — A “smarter” default model may erase margin on a high-volume workflow.
Reliability budgets — Agent loops and long outputs multiply spend per user action.
Procurement conversations — You need a defendable model, not “we picked the famous one.”

Stakeholder question	What you answer with
“What if usage 10×?”	Scaled monthly formula + tier/routing plan
“Why is support AI expensive?”	Output length + RAG input + tool round-trips
“Can we use the frontier model everywhere?”	Cost per successful task by tier
“Is caching worth engineering?”	Break-even on cache hit rate vs build time

PM takeaway

If you cannot sketch monthly cost at 10k and 1M calls, you are not ready to commit to a default model in production.

2. Token Types and What Gets Billed

Providers bill by token category. Names differ by vendor; the PM concepts are stable.

Category	Typical contents	PM note
Input	System prompt, user message, RAG chunks, tool results fed back in	Often the largest line item for grounded apps
Output	Model completion text (and sometimes “reasoning” tokens)	Usually priced higher per token than input
Cached input	Prefix repeated across calls (system prompt, doc index)	Separate rate P_cache; hits reduce average cost
Tool / infra	Embeddings, rerankers, speech, vision units, vector DB, egress	Not always in the LLM invoice—still COGS

Symbolic unit prices (per million tokens or per your vendor’s unit):

Symbol	Meaning
P_in	Price per input token (or per 1M input tokens)
P_out	Price per output token
P_cache	Price per cached input token (read from cache)
C_tool	Fixed + variable tool/infra cost per call

Multimodal and long-context models may bill images, audio seconds, or context tiers separately. When you spec a feature, list every metered dimension, not only “GPT call.”

3. Why Output Tokens Matter So Much

Teams optimize prompts for input size—and forget that verbose outputs dominate many bills. P_out is often several times P_in on the same model family.

High-output patterns PMs should flag in reviews:

“Explain your reasoning in detail” on every ticket
JSON wrapped in long natural-language preambles
Agent scratchpads kept in the user-visible thread
Unbounded “summarize this 40-page PDF” without length caps
Multiple draft variants per request (A/B at full generation cost)

Design lever	Effect on bill
Max output tokens	Hard cap on worst case per call
Structured output (schema)	Shorter, parseable completions
Two-step: classify then generate	Small model for routing; large only when needed
Streaming + early stop	Stop when confidence threshold met (with care)

Watch the denominator

Finance will ask cost per successful outcome. A cheap model with 40% rework can cost more than a pricier model with 90% first-pass success.

4. Unit Economics per Call

Define one representative inference call for your workflow:

T_in = input tokens per call
T_out = output tokens per call
T_cache = cached input tokens billed at cache rate (often a fraction of T_in)
T_in,uncached = T_in − T_cache (when cache accounting splits uncached vs cached)

Cost per call (symbolic):

C_call = (T_in,uncached × P_in) + (T_cache × P_cache) + (T_out × P_out) + C_tool

Illustrative single-call math (plug your vendor prices)
Variable	Example value	Line cost
T_in,uncached	3,000	3,000 × P_in
T_cache	2,000	2,000 × P_cache
T_out	500	500 × P_out
C_tool	embedding + rerank	fixed per call
Total	—	C_call

Measure T_in and T_out from staging logs (tokenizer or provider usage fields). Guessing “about 2k tokens” in a exec deck is how features get underpriced.

5. Monthly Cost Formula

Let N = successful API calls per month (or per business unit). Let overhead C_fixed cover dashboards, support contracts, reserved capacity, and fine-tune hosting.

Monthly Cost = N × C_call + C_fixed

Expanded:

Monthly Cost = N × [(Input Tokens × P_in) + (Output Tokens × P_out) + (Cached Tokens × P_cache) + C_tool] + C_fixed

For multi-step agents, sum C_call across steps or use C_run per user task × tasks per month.

PM takeaway

Publish three scenarios—pilot, target, stress—each with its own N and token assumptions. Leadership remembers the stress number.

6. Scaling Scenarios: 10k, 100k, and 1M Calls

Use the same C_call; only N changes. This table is structural—replace symbols with priced values from your contract.

Monthly calls (N)	Inference subtotal	+ C_fixed	PM question to ask
10,000	10,000 × C_call	Monthly Cost	Does pilot budget match one team’s workflow?
100,000	100,000 × C_call	Monthly Cost	Do we need routing before wide rollout?
1,000,000	1,000,000 × C_call	Monthly Cost	Does gross margin hold at list price × conversion?

Sensitivity knobs (rank these for your feature):

±20% on T_out (verbosity policy)
±30% on T_in (RAG chunk count)
Cache hit rate on system prompt (0% vs 70% hit)
Share of calls routed to mid-tier vs frontier
Agent steps per task (1 vs 4 LLM calls)

Scenario	C_call vs baseline	At N = 1M
Baseline	1.0×	1.0M × C_call
Verbose outputs (+25% T_out)	~1.15–1.25×	Proportional increase
70% cache hit on 40% of input	Depends on P_cache	Recompute with split input
50% calls on mid-tier (½ P_in, ½ P_out)	Blended rate	Weighted average tiers

7. Model Tiers: Frontier, Mid, and Small

Tiers are product choices, not vanity labels. Map tiers to workflows—not to “we always use the best.”

Tier	Typical strengths	Typical PM use	Cost / latency
Frontier	Hard reasoning, long context, multimodal	Escalation, complex drafts, high-stakes review assist	Highest P_in/P_out; slower TTFT often
Mid	Strong general quality at lower price	Default copilot, Q&A with RAG, most chat	Balanced unit economics
Small	Fast classification, extraction, routing	Intent detect, JSON slot-fill, guardrail checks	Lowest per token; may need more repair loops

Relative pricing is usually order-of-magnitude, not 5%: small may be 10–30× cheaper per token than frontier on the same provider stack. Exact ratios belong in your memo with dated P values.

PM takeaway

Default to mid tier for volume paths; reserve frontier for steps that fail evals on mid—or for explicit user “deep analysis” opt-in.

8. Routing and Cascade Architectures

Routing sends each request to a model tier based on rules or a cheap classifier. Cascades try a cheap path first, then escalate only when confidence is low or eval checks fail.

Pattern	How it saves money	PM risk
Intent router → small / mid / large	Most traffic avoids frontier	Mis-routing on ambiguous queries
Cascade: small answer → judge → escalate	Pay frontier only on hard subset	Latency stacks on escalations
Tool-first: DB lookup before LLM	Zero tokens when lookup succeeds	Product feels “dumb” if routing too aggressive
Human queue for low confidence	Avoid repeated auto-retries	Ops cost moves, does not disappear

Model routing in the PRD should specify: triggers for escalation, max steps per user task, and eval gates that block silent downgrade of quality.

9. Caching Economics

Prompt caching (provider feature) discounts repeated long prefixes—system prompts, tool definitions, stable policy docs. You pay a cache write cost once, then reduced P_cache on hits.

Concept	PM implication
Cacheable prefix	Put stable content first; volatile user text last
Hit rate	High-volume features with shared system prompt benefit most
TTL / invalidation	Policy change must bust cache or you serve stale instructions
Application cache	Store embeddings and retrieval results—not only LLM prefix cache

Break-even sketch: if cache engineering costs E engineer-weeks, compare monthly savings N × T_prefix × hit_rate × (P_in − P_cache) against E.

PM takeaway

Caching is a product architecture decision—stable system prompt design—not only an infra ticket.

10. Latency vs Cost

Users feel time to first token (TTFT) and total time to last token. Finance feels monthly tokens. They pull in opposite directions.

UX need	Typical lever	Cost note
Chat feels instant	Streaming, smaller model, shorter context	Smaller model may need second call
Batch overnight jobs	Batch API, lower priority queue	Lower P per token; not for interactive UX
Complex analysis	Frontier + long output	Accept higher C_call and p95 latency
Global users	Regional endpoints	May affect data residency and price list

Spec latency budgets per workflow tier: e.g. “routing < 300ms,” “user-visible draft starts < 2s TTFT,” “full report < 45s.” Pair each budget with an allowed tier and max tokens.

11. Cost Risks and Failure Modes

Retry storms — Failed tool calls that re-invoke the LLM with full context each time.
Context creep — Chat history grows unbounded; every turn re-sends the thread.
Agent loops — “Try again” until max steps without a stopping rule.
Free-tier abuse — Power users on unlimited plans burning frontier models.
Eval gap — Cheap model ships; human rework costs more than inference saved.
Price list change — No alert when vendor updates P_in/P_out.
Shadow AI — Teams paste production data into consumer chat tools (different risk, still cost and compliance).

Control	What it limits
Per-user / per-tenant token budgets	Runaway sessions
Max steps per agent task	Loop spend
Cost dashboards by feature flag	Surprises at 1M scale
Canary on price change	Margin erosion

12. Hands-On Exercise

Pick one real or hypothetical AI feature (e.g. internal policy Q&A, code review assist, claims note draft). Complete:

Log or estimate T_in, T_out, and tool costs for one happy-path call.
Look up current P_in, P_out, P_cache on your vendor site—date-stamp them in a footnote.
Compute C_call and monthly cost at N = 10k, 100k, 1M.
Propose one routing rule and one output cap that cut cost ≥20% with acceptable eval risk.
Write one sentence for finance: “At 100k calls/month we spend approximately ___ assuming ___.”

Done when

You can explain the biggest cost driver in plain language (usually output length, RAG input, or agent steps—not “the model brand”).

13. Decision Framework Matrix

Use this when choosing default tier and architecture for a new workflow:

Dimension	Lean cheap	Balanced	Lean quality
Task difficulty	Classification, extraction	RAG Q&A, drafting with review	Multi-step reasoning, rare edge cases
Default tier	Small	Mid	Frontier (or cascade into it)
Context strategy	Minimal prompt	RAG top-k tuned	Long context + tools
Output policy	Strict max tokens + schema	Templates + citations	Longer drafts, human edit
Latency target	< 1s TTFT	2–5s acceptable	Quality over speed
When margin fails	Batch, cache, cut features	Route + compress prompts	Charge premium tier or limit volume

14. Common PM Mistakes

Mistake	Why it hurts
Quoting list price without token counts	“$3 per million” is meaningless alone
Assuming input is the whole bill	Output and tools often dominate
Frontier as default for all traffic	Margin collapses at scale
Ignoring failed calls	Retries double spend
No observability by feature	Cannot attribute overrun
Benchmark cost on demo prompts only	Production prompts are 3–10× larger
Hardcoding prices in code/docs	Stale numbers mislead leadership

Chapter Summary

Concept	PM understanding
Formula	Monthly = N × (input + output + cache + tools) + fixed
Symbols	P_in, P_out, P_cache from vendor; refresh often
Output tokens	Often the silent margin killer
Scale	Model 10k / 100k / 1M with same C_call
Tiers	Small / mid / frontier mapped to workflows
Routing & cache	Architecture levers, not afterthoughts
Latency	Trade speed, quality, and C_call explicitly
Next step	Chapter 6 — model selection memo

Closing Thought

The PM who wins budget and trust shows the spreadsheet: tokens per call, price symbols dated this quarter, three volume scenarios, and a plan when the stress case happens.

Model choice without economics is branding. Economics without evals is false savings. You need both—which is why the next chapter is the capstone memo that ties benchmarks, cost, and risk into one decision document.

The real PM lesson

Own the formula. Refresh the prices. Defend the tier mix with evals and unit economics together.

Chapter navigation

← Previous

Chapter 4: Benchmarks — What PMs Should and Shouldn't Trust — The PM Version

MMLU, HumanEval, SWE-bench, Arena, and how to design evals that match your product.

Read chapter → Next →

Chapter 6: The Model Selection Memo — The PM Version

Capstone templates and an example memo for a claims shortfall assistant.

Read chapter →

← Chapter 4 Back to Module Back to Blog AI Learning