Introduction
Module 02 walked from model families ( Chapter 1 ) through vendor fit ( Chapter 2 ), multimodal scope (Chapter 3), benchmarks (Chapter 4), and unit economics ( Chapter 5 ).
This chapter is the deliverable: a model selection memo you can attach to a PRD, send to engineering, and defend in a review. It is not a slide deck of logos. It is a decision record—what you chose, what you rejected, what it costs at scale, how you will know if quality holds, and what can go wrong.
If the memo does not name a default model, a fallback tier, and a revisit trigger, it is not finished.
PM takeaway
The memo is the contract between product intent and ML ops reality. Write it before you promise a launch date.
1. Why a Memo Exists
Without a memo, teams default to:
- Whoever shouted loudest in the vendor demo
- The model the principal engineer already uses
- “Frontier everywhere” until finance intervenes six months later
A good memo forces alignment on:
| Audience | What they need from the memo |
|---|---|
| Engineering | Default endpoints, tiers, routing, eval gates |
| Design / CX | Latency budget, confidence UX, human review points |
| Finance | N scenarios and symbolic cost formula with dated prices |
| Legal / compliance | Data residency, retention, prohibited uses |
| Leadership | One-page recommendation and explicit risks |
Length: typically 3–8 pages plus appendices (eval set description, sample prompts). Shorter is fine for a pilot; production needs the full stack.
2. Recommended Memo Structure
- Executive summary — Decision in two sentences; default model + tier mix.
- Problem & user job — Who uses it, success metric, what “wrong” looks like.
- Constraints — Latency, languages, modalities, residency, budget cap.
- Candidate shortlist — 2–4 options (families or specific endpoints), not a market map.
- Selection matrix — Scored on your dimensions (template below).
- Quality plan — Eval set, pass thresholds, human review sample rate.
- Cost model — Ccall, monthly at 10k / 100k / 1M, sensitivity notes.
- Architecture sketch — RAG, tools, routing, cache, HITL—not optional for grounded apps.
- Risk register — Hallucination, privacy, vendor lock-in, price change.
- Recommendation — Primary, fallback, escalation path.
- Revisit triggers — New model release, eval drift, 2× volume, incident.
- Open questions — What you will learn in pilot week 1–4.
PM takeaway
Put the recommendation on page one. Appendices hold the evidence; busy reviewers never read page six first.
3. Selection Matrix Template
Score 1–5 or Red / Yellow / Green per candidate. Weight rows by product priority.
| Dimension | Weight | Candidate A | Candidate B | Candidate C |
|---|---|---|---|---|
| Task quality on golden set | High | |||
| Structured output reliability | High | |||
| Latency (p95 TTFT / total) | Med | |||
| Context + RAG fit | Med | |||
| Tool / function calling | Med | |||
| Safety & refusals (appropriate) | Med | |||
| Unit economics at target N | High | |||
| Ops: logging, eval hooks, SLAs | Med | |||
| Compliance (residency, BAA, etc.) | As needed | |||
| Vendor / lock-in risk | Low–Med |
Do not fill this from public leaderboard ranks. Run the same 30–100 golden prompts per candidate with production-like context size.
4. Cost Model Template
Copy this block into your memo; replace symbols with dated vendor prices.
| Field | Value (your feature) |
|---|---|
| Price date / source | Vendor pricing page URL + date |
| Pin, Pout, Pcache | Primary model; note batch/discount if applicable |
| Tin, Tout per call (measured) | From staging logs, p50 and p95 |
| Ctool per call | Embeddings, rerank, search, etc. |
| Ccall | Formula from Chapter 5 |
| Monthly at N = 10k / 100k / 1M | Three rows |
| Blended tier mix | e.g. 70% mid / 25% small / 5% frontier |
| Stress case | +25% output, −30% cache hit, or 2× agent steps |
Add one line: Cost per successful outcome = monthly inference ÷ (calls × success rate). That is the number product and finance should debate.
5. Quality and Eval Template
| Section | Include |
|---|---|
| Golden set | Size, source (prod samples / synthetic), refresh cadence |
| Metrics | Exact match, rubric score, citation accuracy, human pass rate |
| Pass threshold | e.g. ≥92% rubric ≥4/5 on stratified sample |
| Failure taxonomy | Hallucination, wrong policy section, format break, unsafe |
| HITL | % reviewed pre-send; SLA; feedback loop to eval set |
| Regression plan | Run eval on every prompt or model change |
Reference Chapter 4: public benchmarks inform priors; they do not replace product evals.
Do not ship on vibe scores
“Seemed better in the playground” is not a pass threshold. Name the metric and the cutoff.
6. Risk Register Template
| Risk | Likelihood | Impact | Mitigation | Owner |
|---|---|---|---|---|
| Hallucinated policy citation | Med | High | RAG + cite-or-abstain + human review on send | PM + Eng |
| PII in logs / prompts | Med | High | Redaction, retention policy, no training on customer data | Security |
| Vendor outage | Low | High | Fallback endpoint, graceful degradation copy | Eng |
| Price increase | Med | Med | Tier routing, cache, quarterly cost review | PM |
| Eval drift in production | Med | Med | Weekly sample audit, user thumbs-down → golden set | PM |
| Over-automation | Med | High | HITL on denials / payments; clear AI disclosure | PM + Legal |
7. Example Memo (Hypothetical): Claims Shortfall Assistant
Fictional insurer workflow for teaching only—not a real product or vendor endorsement.
Executive summary
Decision: Ship a “shortfall explainer” assist for adjusters: mid-tier general model as default, small model for intent routing, frontier only on escalated complex cases. Ground all answers in policy corpus via RAG; no autonomous payment decisions.
Problem
Adjusters spend 12–18 minutes per claim explaining benefit shortfalls to members. Goal: cut draft prep to <5 minutes with adjuster edit before send. Success: ≥85% drafts accepted with minor edits; zero uncited policy claims in audit sample.
Constraints
- p95 TTFT < 2.5s; full draft < 20s
- English only v1; PHI in approved region only
- Monthly inference budget cap: symbolic B (filled by finance)
Shortlist
- A: Mid-tier closed API (general instruct model)
- B: Alternate mid-tier (strong long context)
- C: Small + cascade to mid (lowest cost path)
Selection matrix (excerpt)
| Dimension | A | B | C |
|---|---|---|---|
| Golden set rubric (n=80) | 4.3 | 4.4 | 3.9 |
| Citation accuracy | 94% | 96% | 88% |
| p95 latency | 2.1s | 2.8s | 1.8s (+ escalate) |
| Ccall at measured tokens | 1.0× | 1.1× | 0.65× |
| Compliance checklist | Pass | Pass | Pass |
Pick: B as primary for citation quality; C as cost-optimized path for simple shortfall types after 4-week pilot if eval gap closes to <2 points.
Architecture
- Router (small): classify shortfall type + retrieve top-k policy chunks
- Generator (mid): draft letter with mandatory citation slots
- Validator: JSON schema + “every paragraph has ≥1 citation id”
- HITL: adjuster must approve; model cannot send to member directly
Cost sketch (symbolic)
Measured per draft: Tin ≈ 4,200 (system + RAG), Tout ≈ 650, Ctool = embed + rerank. At N = 100k drafts/month, monthly inference = 100,000 × Ccall + Cfixed. Stress: +25% Tout if adjusters ask for “more detail” button without caps.
Risks (excerpt)
Wrong policy year in corpus → mitigated by metadata filter on effective date. Member-facing tone errors → mitigated by template + banned phrases list in system prompt (Module 03 will deepen prompt structure).
Revisit triggers
- Citation accuracy < 90% for two consecutive weeks
- Monthly cost > 110% of budget B
- New mid-tier model with signed BAA and better eval on same golden set
PM takeaway
The example is opinionated: quality and compliance beat lowest Ccall, with a documented path to optimize later.
8. Self-Check: 10 Questions Before You Ship the Memo
Answer honestly. Any “no” means the memo is not ready for sign-off.
- Can you state the user job and success metric in one sentence each?
- Did you run the same golden prompts on every shortlisted candidate?
- Is there a numeric pass threshold—not only qualitative “better”?
- Did you measure Tin and Tout on production-like prompts, not demos?
- Are Pin, Pout, and Pcache dated and sourced?
- Do you show monthly cost at 10k, 100k, and 1M (or your real scale)?
- Is there a default tier and an explicit escalation tier?
- Does the architecture name RAG/tools/HITL where grounding or safety requires it?
- Are top three risks mitigated with owners—not only listed?
- Did you define revisit triggers (eval drift, cost, incident, new model)?
Capstone complete when
A staff engineer and a finance partner can read your memo and argue specifics—not ask “which model should we use?”
9. Bridge to Module 03 — Prompt Engineering
The memo picks which model and what architecture. Module 03 (Prompt Engineering — Techniques & Structure) picks how you instruct that model reliably: system prompt anatomy, few-shot design, chain-of-thought for hard steps, XML delimiters, and eval rubrics tied to prompt versions.
Carry these forward from Module 02 into prompt work:
- Token budget from your cost model → max context and output caps in the PRD
- Failure taxonomy from evals → negative instructions and abstain rules
- Cacheable system prefix → stable policy block designed for prompt caching
- Revisit triggers → prompt version changelog and regression evals
A weak prompt on the right model still fails evals. A strong prompt on the wrong tier still fails margin. You need both—which is why the learning path treats economics and prompting as adjacent modules, not rivals.
Chapter Summary
| Artifact | Purpose |
|---|---|
| Selection matrix | Compare candidates on product-weighted dimensions |
| Cost block | Symbolic formula + 10k / 100k / 1M scenarios |
| Quality block | Golden set, metrics, thresholds, HITL |
| Risk register | Mitigations with owners |
| Example memo | Template for grounded, regulated workflows |
| Self-check | 10 questions before sign-off |
| Next module | Prompt engineering for reliable instructions |
Closing Thought
Model selection is where PM credibility in AI products is won. Not by naming the trendiest endpoint, but by documenting a choice that still makes sense when usage 10× and the vendor changes the price list.
Finish your memo. Run the self-check. Then move to prompts—the layer that turns a defensible model choice into a defensible user experience.
Module 02 complete
You can compare families, read benchmarks skeptically, model token economics, and sign a selection memo. That is the baseline for every AI feature you ship next.