Introduction
In Module 01, Chapter 3, we covered tokens and context windows—the working memory of an LLM call.
In Chapter 4, we covered benchmarks and product-specific evals—how to compare models on your tasks, not leaderboard hype.
This chapter connects quality decisions to money and speed. You will not find provider price tables here. Prices change; contracts differ; cached and batch rates have their own rules. What stays useful is the structure of the bill.
Monthly Cost = (Input Tokens × Input Price) + (Output Tokens × Output Price) + (Cached Tokens × Cached Price) + Tool/Infra
Plug in current list prices from your vendor’s pricing page when you build a memo. The PM job is to own the variables—volume, token mix, tier, routing, cache hit rate—not to memorize yesterday’s cents per million.
PM takeaway
Cost is a product constraint, not a finance surprise after launch. Model it in the PRD the same way you model API rate limits.
1. Why Cost Matters for PMs
Traditional SaaS features have near-zero marginal cost per extra click. LLM features do not. Each successful completion burns input tokens, output tokens, and often retrieval, tools, and human review on top.
Cost shows up in four places PMs care about:
- Unit economics — Can this feature survive at expected usage and price?
- Roadmap tradeoffs — A “smarter” default model may erase margin on a high-volume workflow.
- Reliability budgets — Agent loops and long outputs multiply spend per user action.
- Procurement conversations — You need a defendable model, not “we picked the famous one.”
| Stakeholder question | What you answer with |
|---|---|
| “What if usage 10×?” | Scaled monthly formula + tier/routing plan |
| “Why is support AI expensive?” | Output length + RAG input + tool round-trips |
| “Can we use the frontier model everywhere?” | Cost per successful task by tier |
| “Is caching worth engineering?” | Break-even on cache hit rate vs build time |
PM takeaway
If you cannot sketch monthly cost at 10k and 1M calls, you are not ready to commit to a default model in production.
2. Token Types and What Gets Billed
Providers bill by token category. Names differ by vendor; the PM concepts are stable.
| Category | Typical contents | PM note |
|---|---|---|
| Input | System prompt, user message, RAG chunks, tool results fed back in | Often the largest line item for grounded apps |
| Output | Model completion text (and sometimes “reasoning” tokens) | Usually priced higher per token than input |
| Cached input | Prefix repeated across calls (system prompt, doc index) | Separate rate Pcache; hits reduce average cost |
| Tool / infra | Embeddings, rerankers, speech, vision units, vector DB, egress | Not always in the LLM invoice—still COGS |
Symbolic unit prices (per million tokens or per your vendor’s unit):
| Symbol | Meaning |
|---|---|
| Pin | Price per input token (or per 1M input tokens) |
| Pout | Price per output token |
| Pcache | Price per cached input token (read from cache) |
| Ctool | Fixed + variable tool/infra cost per call |
Multimodal and long-context models may bill images, audio seconds, or context tiers separately. When you spec a feature, list every metered dimension, not only “GPT call.”
3. Why Output Tokens Matter So Much
Teams optimize prompts for input size—and forget that verbose outputs dominate many bills. Pout is often several times Pin on the same model family.
High-output patterns PMs should flag in reviews:
- “Explain your reasoning in detail” on every ticket
- JSON wrapped in long natural-language preambles
- Agent scratchpads kept in the user-visible thread
- Unbounded “summarize this 40-page PDF” without length caps
- Multiple draft variants per request (A/B at full generation cost)
| Design lever | Effect on bill |
|---|---|
| Max output tokens | Hard cap on worst case per call |
| Structured output (schema) | Shorter, parseable completions |
| Two-step: classify then generate | Small model for routing; large only when needed |
| Streaming + early stop | Stop when confidence threshold met (with care) |
Watch the denominator
Finance will ask cost per successful outcome. A cheap model with 40% rework can cost more than a pricier model with 90% first-pass success.
4. Unit Economics per Call
Define one representative inference call for your workflow:
- Tin = input tokens per call
- Tout = output tokens per call
- Tcache = cached input tokens billed at cache rate (often a fraction of Tin)
- Tin,uncached = Tin − Tcache (when cache accounting splits uncached vs cached)
Cost per call (symbolic):
Ccall = (Tin,uncached × Pin) + (Tcache × Pcache) + (Tout × Pout) + Ctool
| Variable | Example value | Line cost |
|---|---|---|
| Tin,uncached | 3,000 | 3,000 × Pin |
| Tcache | 2,000 | 2,000 × Pcache |
| Tout | 500 | 500 × Pout |
| Ctool | embedding + rerank | fixed per call |
| Total | — | Ccall |
Measure Tin and Tout from staging logs (tokenizer or provider usage fields). Guessing “about 2k tokens” in a exec deck is how features get underpriced.
5. Monthly Cost Formula
Let N = successful API calls per month (or per business unit). Let overhead Cfixed cover dashboards, support contracts, reserved capacity, and fine-tune hosting.
Monthly Cost = N × Ccall + Cfixed
Expanded:
Monthly Cost = N × [(Input Tokens × Pin) + (Output Tokens × Pout) + (Cached Tokens × Pcache) + Ctool] + Cfixed
For multi-step agents, sum Ccall across steps or use Crun per user task × tasks per month.
PM takeaway
Publish three scenarios—pilot, target, stress—each with its own N and token assumptions. Leadership remembers the stress number.
6. Scaling Scenarios: 10k, 100k, and 1M Calls
Use the same Ccall; only N changes. This table is structural—replace symbols with priced values from your contract.
| Monthly calls (N) | Inference subtotal | + Cfixed | PM question to ask |
|---|---|---|---|
| 10,000 | 10,000 × Ccall | Monthly Cost | Does pilot budget match one team’s workflow? |
| 100,000 | 100,000 × Ccall | Monthly Cost | Do we need routing before wide rollout? |
| 1,000,000 | 1,000,000 × Ccall | Monthly Cost | Does gross margin hold at list price × conversion? |
Sensitivity knobs (rank these for your feature):
- ±20% on Tout (verbosity policy)
- ±30% on Tin (RAG chunk count)
- Cache hit rate on system prompt (0% vs 70% hit)
- Share of calls routed to mid-tier vs frontier
- Agent steps per task (1 vs 4 LLM calls)
| Scenario | Ccall vs baseline | At N = 1M |
|---|---|---|
| Baseline | 1.0× | 1.0M × Ccall |
| Verbose outputs (+25% Tout) | ~1.15–1.25× | Proportional increase |
| 70% cache hit on 40% of input | Depends on Pcache | Recompute with split input |
| 50% calls on mid-tier (½ Pin, ½ Pout) | Blended rate | Weighted average tiers |
7. Model Tiers: Frontier, Mid, and Small
Tiers are product choices, not vanity labels. Map tiers to workflows—not to “we always use the best.”
| Tier | Typical strengths | Typical PM use | Cost / latency |
|---|---|---|---|
| Frontier | Hard reasoning, long context, multimodal | Escalation, complex drafts, high-stakes review assist | Highest Pin/Pout; slower TTFT often |
| Mid | Strong general quality at lower price | Default copilot, Q&A with RAG, most chat | Balanced unit economics |
| Small | Fast classification, extraction, routing | Intent detect, JSON slot-fill, guardrail checks | Lowest per token; may need more repair loops |
Relative pricing is usually order-of-magnitude, not 5%: small may be 10–30× cheaper per token than frontier on the same provider stack. Exact ratios belong in your memo with dated P values.
PM takeaway
Default to mid tier for volume paths; reserve frontier for steps that fail evals on mid—or for explicit user “deep analysis” opt-in.
8. Routing and Cascade Architectures
Routing sends each request to a model tier based on rules or a cheap classifier. Cascades try a cheap path first, then escalate only when confidence is low or eval checks fail.
| Pattern | How it saves money | PM risk |
|---|---|---|
| Intent router → small / mid / large | Most traffic avoids frontier | Mis-routing on ambiguous queries |
| Cascade: small answer → judge → escalate | Pay frontier only on hard subset | Latency stacks on escalations |
| Tool-first: DB lookup before LLM | Zero tokens when lookup succeeds | Product feels “dumb” if routing too aggressive |
| Human queue for low confidence | Avoid repeated auto-retries | Ops cost moves, does not disappear |
Model routing in the PRD should specify: triggers for escalation, max steps per user task, and eval gates that block silent downgrade of quality.
9. Caching Economics
Prompt caching (provider feature) discounts repeated long prefixes—system prompts, tool definitions, stable policy docs. You pay a cache write cost once, then reduced Pcache on hits.
| Concept | PM implication |
|---|---|
| Cacheable prefix | Put stable content first; volatile user text last |
| Hit rate | High-volume features with shared system prompt benefit most |
| TTL / invalidation | Policy change must bust cache or you serve stale instructions |
| Application cache | Store embeddings and retrieval results—not only LLM prefix cache |
Break-even sketch: if cache engineering costs E engineer-weeks, compare monthly savings N × Tprefix × hit_rate × (Pin − Pcache) against E.
PM takeaway
Caching is a product architecture decision—stable system prompt design—not only an infra ticket.
10. Latency vs Cost
Users feel time to first token (TTFT) and total time to last token. Finance feels monthly tokens. They pull in opposite directions.
| UX need | Typical lever | Cost note |
|---|---|---|
| Chat feels instant | Streaming, smaller model, shorter context | Smaller model may need second call |
| Batch overnight jobs | Batch API, lower priority queue | Lower P per token; not for interactive UX |
| Complex analysis | Frontier + long output | Accept higher Ccall and p95 latency |
| Global users | Regional endpoints | May affect data residency and price list |
Spec latency budgets per workflow tier: e.g. “routing < 300ms,” “user-visible draft starts < 2s TTFT,” “full report < 45s.” Pair each budget with an allowed tier and max tokens.
11. Cost Risks and Failure Modes
- Retry storms — Failed tool calls that re-invoke the LLM with full context each time.
- Context creep — Chat history grows unbounded; every turn re-sends the thread.
- Agent loops — “Try again” until max steps without a stopping rule.
- Free-tier abuse — Power users on unlimited plans burning frontier models.
- Eval gap — Cheap model ships; human rework costs more than inference saved.
- Price list change — No alert when vendor updates Pin/Pout.
- Shadow AI — Teams paste production data into consumer chat tools (different risk, still cost and compliance).
| Control | What it limits |
|---|---|
| Per-user / per-tenant token budgets | Runaway sessions |
| Max steps per agent task | Loop spend |
| Cost dashboards by feature flag | Surprises at 1M scale |
| Canary on price change | Margin erosion |
12. Hands-On Exercise
Pick one real or hypothetical AI feature (e.g. internal policy Q&A, code review assist, claims note draft). Complete:
- Log or estimate Tin, Tout, and tool costs for one happy-path call.
- Look up current Pin, Pout, Pcache on your vendor site—date-stamp them in a footnote.
- Compute Ccall and monthly cost at N = 10k, 100k, 1M.
- Propose one routing rule and one output cap that cut cost ≥20% with acceptable eval risk.
- Write one sentence for finance: “At 100k calls/month we spend approximately ___ assuming ___.”
Done when
You can explain the biggest cost driver in plain language (usually output length, RAG input, or agent steps—not “the model brand”).
13. Decision Framework Matrix
Use this when choosing default tier and architecture for a new workflow:
| Dimension | Lean cheap | Balanced | Lean quality |
|---|---|---|---|
| Task difficulty | Classification, extraction | RAG Q&A, drafting with review | Multi-step reasoning, rare edge cases |
| Default tier | Small | Mid | Frontier (or cascade into it) |
| Context strategy | Minimal prompt | RAG top-k tuned | Long context + tools |
| Output policy | Strict max tokens + schema | Templates + citations | Longer drafts, human edit |
| Latency target | < 1s TTFT | 2–5s acceptable | Quality over speed |
| When margin fails | Batch, cache, cut features | Route + compress prompts | Charge premium tier or limit volume |
14. Common PM Mistakes
| Mistake | Why it hurts |
|---|---|
| Quoting list price without token counts | “$3 per million” is meaningless alone |
| Assuming input is the whole bill | Output and tools often dominate |
| Frontier as default for all traffic | Margin collapses at scale |
| Ignoring failed calls | Retries double spend |
| No observability by feature | Cannot attribute overrun |
| Benchmark cost on demo prompts only | Production prompts are 3–10× larger |
| Hardcoding prices in code/docs | Stale numbers mislead leadership |
Chapter Summary
| Concept | PM understanding |
|---|---|
| Formula | Monthly = N × (input + output + cache + tools) + fixed |
| Symbols | Pin, Pout, Pcache from vendor; refresh often |
| Output tokens | Often the silent margin killer |
| Scale | Model 10k / 100k / 1M with same Ccall |
| Tiers | Small / mid / frontier mapped to workflows |
| Routing & cache | Architecture levers, not afterthoughts |
| Latency | Trade speed, quality, and Ccall explicitly |
| Next step | Chapter 6 — model selection memo |
Closing Thought
The PM who wins budget and trust shows the spreadsheet: tokens per call, price symbols dated this quarter, three volume scenarios, and a plan when the stress case happens.
Model choice without economics is branding. Economics without evals is false savings. You need both—which is why the next chapter is the capstone memo that ties benchmarks, cost, and risk into one decision document.
The real PM lesson
Own the formula. Refresh the prices. Defend the tier mix with evals and unit economics together.
Chapter navigation
Chapter 4: Benchmarks — What PMs Should and Shouldn't Trust — The PM Version
MMLU, HumanEval, SWE-bench, Arena, and how to design evals that match your product.
Read chapter → Next →Chapter 6: The Model Selection Memo — The PM Version
Capstone templates and an example memo for a claims shortfall assistant.
Read chapter →