The Model Selection Memo — The PM Version

Introduction

Module 02 walked from model families ( Chapter 1 ) through vendor fit ( Chapter 2 ), multimodal scope (Chapter 3), benchmarks (Chapter 4), and unit economics ( Chapter 5 ).

This chapter is the deliverable: a model selection memo you can attach to a PRD, send to engineering, and defend in a review. It is not a slide deck of logos. It is a decision record—what you chose, what you rejected, what it costs at scale, how you will know if quality holds, and what can go wrong.

If the memo does not name a default model, a fallback tier, and a revisit trigger, it is not finished.

PM takeaway

The memo is the contract between product intent and ML ops reality. Write it before you promise a launch date.

1. Why a Memo Exists

Without a memo, teams default to:

Whoever shouted loudest in the vendor demo
The model the principal engineer already uses
“Frontier everywhere” until finance intervenes six months later

A good memo forces alignment on:

Audience	What they need from the memo
Engineering	Default endpoints, tiers, routing, eval gates
Design / CX	Latency budget, confidence UX, human review points
Finance	N scenarios and symbolic cost formula with dated prices
Legal / compliance	Data residency, retention, prohibited uses
Leadership	One-page recommendation and explicit risks

Length: typically 3–8 pages plus appendices (eval set description, sample prompts). Shorter is fine for a pilot; production needs the full stack.

2. Recommended Memo Structure

Executive summary — Decision in two sentences; default model + tier mix.
Problem & user job — Who uses it, success metric, what “wrong” looks like.
Constraints — Latency, languages, modalities, residency, budget cap.
Candidate shortlist — 2–4 options (families or specific endpoints), not a market map.
Selection matrix — Scored on your dimensions (template below).
Quality plan — Eval set, pass thresholds, human review sample rate.
Cost model — C_call, monthly at 10k / 100k / 1M, sensitivity notes.
Architecture sketch — RAG, tools, routing, cache, HITL—not optional for grounded apps.
Risk register — Hallucination, privacy, vendor lock-in, price change.
Recommendation — Primary, fallback, escalation path.
Revisit triggers — New model release, eval drift, 2× volume, incident.
Open questions — What you will learn in pilot week 1–4.

PM takeaway

Put the recommendation on page one. Appendices hold the evidence; busy reviewers never read page six first.

3. Selection Matrix Template

Score 1–5 or Red / Yellow / Green per candidate. Weight rows by product priority.

Dimension	Weight	Candidate A	Candidate B	Candidate C
Task quality on golden set	High
Structured output reliability	High
Latency (p95 TTFT / total)	Med
Context + RAG fit	Med
Tool / function calling	Med
Safety & refusals (appropriate)	Med
Unit economics at target N	High
Ops: logging, eval hooks, SLAs	Med
Compliance (residency, BAA, etc.)	As needed
Vendor / lock-in risk	Low–Med

Do not fill this from public leaderboard ranks. Run the same 30–100 golden prompts per candidate with production-like context size.

4. Cost Model Template

Copy this block into your memo; replace symbols with dated vendor prices.

Field	Value (your feature)
Price date / source	Vendor pricing page URL + date
P_in, P_out, P_cache	Primary model; note batch/discount if applicable
T_in, T_out per call (measured)	From staging logs, p50 and p95
C_tool per call	Embeddings, rerank, search, etc.
C_call	Formula from Chapter 5
Monthly at N = 10k / 100k / 1M	Three rows
Blended tier mix	e.g. 70% mid / 25% small / 5% frontier
Stress case	+25% output, −30% cache hit, or 2× agent steps

Add one line: Cost per successful outcome = monthly inference ÷ (calls × success rate). That is the number product and finance should debate.

5. Quality and Eval Template

Section	Include
Golden set	Size, source (prod samples / synthetic), refresh cadence
Metrics	Exact match, rubric score, citation accuracy, human pass rate
Pass threshold	e.g. ≥92% rubric ≥4/5 on stratified sample
Failure taxonomy	Hallucination, wrong policy section, format break, unsafe
HITL	% reviewed pre-send; SLA; feedback loop to eval set
Regression plan	Run eval on every prompt or model change

Reference Chapter 4: public benchmarks inform priors; they do not replace product evals.

Do not ship on vibe scores

“Seemed better in the playground” is not a pass threshold. Name the metric and the cutoff.

6. Risk Register Template

Risk	Likelihood	Impact	Mitigation	Owner
Hallucinated policy citation	Med	High	RAG + cite-or-abstain + human review on send	PM + Eng
PII in logs / prompts	Med	High	Redaction, retention policy, no training on customer data	Security
Vendor outage	Low	High	Fallback endpoint, graceful degradation copy	Eng
Price increase	Med	Med	Tier routing, cache, quarterly cost review	PM
Eval drift in production	Med	Med	Weekly sample audit, user thumbs-down → golden set	PM
Over-automation	Med	High	HITL on denials / payments; clear AI disclosure	PM + Legal

7. Example Memo (Hypothetical): Claims Shortfall Assistant

Fictional insurer workflow for teaching only—not a real product or vendor endorsement.

Executive summary

Decision: Ship a “shortfall explainer” assist for adjusters: mid-tier general model as default, small model for intent routing, frontier only on escalated complex cases. Ground all answers in policy corpus via RAG; no autonomous payment decisions.

Problem

Adjusters spend 12–18 minutes per claim explaining benefit shortfalls to members. Goal: cut draft prep to <5 minutes with adjuster edit before send. Success: ≥85% drafts accepted with minor edits; zero uncited policy claims in audit sample.

Constraints

p95 TTFT < 2.5s; full draft < 20s
English only v1; PHI in approved region only
Monthly inference budget cap: symbolic B (filled by finance)

Shortlist

A: Mid-tier closed API (general instruct model)
B: Alternate mid-tier (strong long context)
C: Small + cascade to mid (lowest cost path)

Selection matrix (excerpt)

Dimension	A	B	C
Golden set rubric (n=80)	4.3	4.4	3.9
Citation accuracy	94%	96%	88%
p95 latency	2.1s	2.8s	1.8s (+ escalate)
C_call at measured tokens	1.0×	1.1×	0.65×
Compliance checklist	Pass	Pass	Pass

Pick: B as primary for citation quality; C as cost-optimized path for simple shortfall types after 4-week pilot if eval gap closes to <2 points.

Architecture

Router (small): classify shortfall type + retrieve top-k policy chunks
Generator (mid): draft letter with mandatory citation slots
Validator: JSON schema + “every paragraph has ≥1 citation id”
HITL: adjuster must approve; model cannot send to member directly

Cost sketch (symbolic)

Measured per draft: T_in ≈ 4,200 (system + RAG), T_out ≈ 650, C_tool = embed + rerank. At N = 100k drafts/month, monthly inference = 100,000 × C_call + C_fixed. Stress: +25% T_out if adjusters ask for “more detail” button without caps.

Risks (excerpt)

Wrong policy year in corpus → mitigated by metadata filter on effective date. Member-facing tone errors → mitigated by template + banned phrases list in system prompt (Module 03 will deepen prompt structure).

Revisit triggers

Citation accuracy < 90% for two consecutive weeks
Monthly cost > 110% of budget B
New mid-tier model with signed BAA and better eval on same golden set

PM takeaway

The example is opinionated: quality and compliance beat lowest C_call, with a documented path to optimize later.

8. Self-Check: 10 Questions Before You Ship the Memo

Answer honestly. Any “no” means the memo is not ready for sign-off.

Can you state the user job and success metric in one sentence each?
Did you run the same golden prompts on every shortlisted candidate?
Is there a numeric pass threshold—not only qualitative “better”?
Did you measure T_in and T_out on production-like prompts, not demos?
Are P_in, P_out, and P_cache dated and sourced?
Do you show monthly cost at 10k, 100k, and 1M (or your real scale)?
Is there a default tier and an explicit escalation tier?
Does the architecture name RAG/tools/HITL where grounding or safety requires it?
Are top three risks mitigated with owners—not only listed?
Did you define revisit triggers (eval drift, cost, incident, new model)?

Capstone complete when

A staff engineer and a finance partner can read your memo and argue specifics—not ask “which model should we use?”

9. Bridge to Module 03 — Prompt Engineering

The memo picks which model and what architecture. Module 03 (Prompt Engineering — Techniques & Structure) picks how you instruct that model reliably: system prompt anatomy, few-shot design, chain-of-thought for hard steps, XML delimiters, and eval rubrics tied to prompt versions.

Carry these forward from Module 02 into prompt work:

Token budget from your cost model → max context and output caps in the PRD
Failure taxonomy from evals → negative instructions and abstain rules
Cacheable system prefix → stable policy block designed for prompt caching
Revisit triggers → prompt version changelog and regression evals

A weak prompt on the right model still fails evals. A strong prompt on the wrong tier still fails margin. You need both—which is why the learning path treats economics and prompting as adjacent modules, not rivals.

Chapter Summary

Artifact	Purpose
Selection matrix	Compare candidates on product-weighted dimensions
Cost block	Symbolic formula + 10k / 100k / 1M scenarios
Quality block	Golden set, metrics, thresholds, HITL
Risk register	Mitigations with owners
Example memo	Template for grounded, regulated workflows
Self-check	10 questions before sign-off
Next module	Prompt engineering for reliable instructions

Closing Thought

Model selection is where PM credibility in AI products is won. Not by naming the trendiest endpoint, but by documenting a choice that still makes sense when usage 10× and the vendor changes the price list.

Finish your memo. Run the self-check. Then move to prompts—the layer that turns a defensible model choice into a defensible user experience.

Module 02 complete

You can compare families, read benchmarks skeptically, model token economics, and sign a selection memo. That is the baseline for every AI feature you ship next.

Chapter navigation

← Previous

Chapter 5: Cost per Token, Latency, and Model Tier Tradeoffs — The PM Version

Unit economics, tiers, routing, caching, and the monthly cost formula.

Read chapter →

Next module →

Module 03: Prompt Engineering — Techniques & Structure

System prompts, CoT, delimiters, chaining, and production eval rubrics for prompts.

Coming Soon

← Chapter 5 Back to Module AI Learning Back to Blog