Chapter 06 · Module 02 · Beginner–Intermediate · 28–32 min

Chapter 6: The Model Selection Memo — The PM Version

Turn landscape, benchmarks, and token economics into one decision document your team and leadership can sign.

Book: AI Learning Capstone 28–32 min
Start reading Back to module
Use case Eval Cost Decision

Module 02 deliverable: one memo that survives engineering, finance, and compliance review

Introduction

Module 02 walked from model families ( Chapter 1 ) through vendor fit ( Chapter 2 ), multimodal scope (Chapter 3), benchmarks (Chapter 4), and unit economics ( Chapter 5 ).

This chapter is the deliverable: a model selection memo you can attach to a PRD, send to engineering, and defend in a review. It is not a slide deck of logos. It is a decision record—what you chose, what you rejected, what it costs at scale, how you will know if quality holds, and what can go wrong.

If the memo does not name a default model, a fallback tier, and a revisit trigger, it is not finished.

PM takeaway

The memo is the contract between product intent and ML ops reality. Write it before you promise a launch date.

1. Why a Memo Exists

Without a memo, teams default to:

  • Whoever shouted loudest in the vendor demo
  • The model the principal engineer already uses
  • “Frontier everywhere” until finance intervenes six months later

A good memo forces alignment on:

AudienceWhat they need from the memo
EngineeringDefault endpoints, tiers, routing, eval gates
Design / CXLatency budget, confidence UX, human review points
FinanceN scenarios and symbolic cost formula with dated prices
Legal / complianceData residency, retention, prohibited uses
LeadershipOne-page recommendation and explicit risks

Length: typically 3–8 pages plus appendices (eval set description, sample prompts). Shorter is fine for a pilot; production needs the full stack.

2. Recommended Memo Structure

  1. Executive summary — Decision in two sentences; default model + tier mix.
  2. Problem & user job — Who uses it, success metric, what “wrong” looks like.
  3. Constraints — Latency, languages, modalities, residency, budget cap.
  4. Candidate shortlist — 2–4 options (families or specific endpoints), not a market map.
  5. Selection matrix — Scored on your dimensions (template below).
  6. Quality plan — Eval set, pass thresholds, human review sample rate.
  7. Cost modelCcall, monthly at 10k / 100k / 1M, sensitivity notes.
  8. Architecture sketch — RAG, tools, routing, cache, HITL—not optional for grounded apps.
  9. Risk register — Hallucination, privacy, vendor lock-in, price change.
  10. Recommendation — Primary, fallback, escalation path.
  11. Revisit triggers — New model release, eval drift, 2× volume, incident.
  12. Open questions — What you will learn in pilot week 1–4.

PM takeaway

Put the recommendation on page one. Appendices hold the evidence; busy reviewers never read page six first.

3. Selection Matrix Template

Score 1–5 or Red / Yellow / Green per candidate. Weight rows by product priority.

Dimension Weight Candidate A Candidate B Candidate C
Task quality on golden setHigh
Structured output reliabilityHigh
Latency (p95 TTFT / total)Med
Context + RAG fitMed
Tool / function callingMed
Safety & refusals (appropriate)Med
Unit economics at target NHigh
Ops: logging, eval hooks, SLAsMed
Compliance (residency, BAA, etc.)As needed
Vendor / lock-in riskLow–Med

Do not fill this from public leaderboard ranks. Run the same 30–100 golden prompts per candidate with production-like context size.

4. Cost Model Template

Copy this block into your memo; replace symbols with dated vendor prices.

FieldValue (your feature)
Price date / sourceVendor pricing page URL + date
Pin, Pout, PcachePrimary model; note batch/discount if applicable
Tin, Tout per call (measured)From staging logs, p50 and p95
Ctool per callEmbeddings, rerank, search, etc.
CcallFormula from Chapter 5
Monthly at N = 10k / 100k / 1MThree rows
Blended tier mixe.g. 70% mid / 25% small / 5% frontier
Stress case+25% output, −30% cache hit, or 2× agent steps

Add one line: Cost per successful outcome = monthly inference ÷ (calls × success rate). That is the number product and finance should debate.

5. Quality and Eval Template

SectionInclude
Golden setSize, source (prod samples / synthetic), refresh cadence
MetricsExact match, rubric score, citation accuracy, human pass rate
Pass thresholde.g. ≥92% rubric ≥4/5 on stratified sample
Failure taxonomyHallucination, wrong policy section, format break, unsafe
HITL% reviewed pre-send; SLA; feedback loop to eval set
Regression planRun eval on every prompt or model change

Reference Chapter 4: public benchmarks inform priors; they do not replace product evals.

Do not ship on vibe scores

“Seemed better in the playground” is not a pass threshold. Name the metric and the cutoff.

6. Risk Register Template

RiskLikelihoodImpactMitigationOwner
Hallucinated policy citationMedHighRAG + cite-or-abstain + human review on sendPM + Eng
PII in logs / promptsMedHighRedaction, retention policy, no training on customer dataSecurity
Vendor outageLowHighFallback endpoint, graceful degradation copyEng
Price increaseMedMedTier routing, cache, quarterly cost reviewPM
Eval drift in productionMedMedWeekly sample audit, user thumbs-down → golden setPM
Over-automationMedHighHITL on denials / payments; clear AI disclosurePM + Legal

7. Example Memo (Hypothetical): Claims Shortfall Assistant

Fictional insurer workflow for teaching only—not a real product or vendor endorsement.

Executive summary

Decision: Ship a “shortfall explainer” assist for adjusters: mid-tier general model as default, small model for intent routing, frontier only on escalated complex cases. Ground all answers in policy corpus via RAG; no autonomous payment decisions.

Problem

Adjusters spend 12–18 minutes per claim explaining benefit shortfalls to members. Goal: cut draft prep to <5 minutes with adjuster edit before send. Success: ≥85% drafts accepted with minor edits; zero uncited policy claims in audit sample.

Constraints

  • p95 TTFT < 2.5s; full draft < 20s
  • English only v1; PHI in approved region only
  • Monthly inference budget cap: symbolic B (filled by finance)

Shortlist

  • A: Mid-tier closed API (general instruct model)
  • B: Alternate mid-tier (strong long context)
  • C: Small + cascade to mid (lowest cost path)

Selection matrix (excerpt)

DimensionABC
Golden set rubric (n=80)4.34.43.9
Citation accuracy94%96%88%
p95 latency2.1s2.8s1.8s (+ escalate)
Ccall at measured tokens1.0×1.1×0.65×
Compliance checklistPassPassPass

Pick: B as primary for citation quality; C as cost-optimized path for simple shortfall types after 4-week pilot if eval gap closes to <2 points.

Architecture

  • Router (small): classify shortfall type + retrieve top-k policy chunks
  • Generator (mid): draft letter with mandatory citation slots
  • Validator: JSON schema + “every paragraph has ≥1 citation id”
  • HITL: adjuster must approve; model cannot send to member directly

Cost sketch (symbolic)

Measured per draft: Tin ≈ 4,200 (system + RAG), Tout ≈ 650, Ctool = embed + rerank. At N = 100k drafts/month, monthly inference = 100,000 × Ccall + Cfixed. Stress: +25% Tout if adjusters ask for “more detail” button without caps.

Risks (excerpt)

Wrong policy year in corpus → mitigated by metadata filter on effective date. Member-facing tone errors → mitigated by template + banned phrases list in system prompt (Module 03 will deepen prompt structure).

Revisit triggers

  • Citation accuracy < 90% for two consecutive weeks
  • Monthly cost > 110% of budget B
  • New mid-tier model with signed BAA and better eval on same golden set

PM takeaway

The example is opinionated: quality and compliance beat lowest Ccall, with a documented path to optimize later.

8. Self-Check: 10 Questions Before You Ship the Memo

Answer honestly. Any “no” means the memo is not ready for sign-off.

  1. Can you state the user job and success metric in one sentence each?
  2. Did you run the same golden prompts on every shortlisted candidate?
  3. Is there a numeric pass threshold—not only qualitative “better”?
  4. Did you measure Tin and Tout on production-like prompts, not demos?
  5. Are Pin, Pout, and Pcache dated and sourced?
  6. Do you show monthly cost at 10k, 100k, and 1M (or your real scale)?
  7. Is there a default tier and an explicit escalation tier?
  8. Does the architecture name RAG/tools/HITL where grounding or safety requires it?
  9. Are top three risks mitigated with owners—not only listed?
  10. Did you define revisit triggers (eval drift, cost, incident, new model)?

Capstone complete when

A staff engineer and a finance partner can read your memo and argue specifics—not ask “which model should we use?”

9. Bridge to Module 03 — Prompt Engineering

The memo picks which model and what architecture. Module 03 (Prompt Engineering — Techniques & Structure) picks how you instruct that model reliably: system prompt anatomy, few-shot design, chain-of-thought for hard steps, XML delimiters, and eval rubrics tied to prompt versions.

Carry these forward from Module 02 into prompt work:

  • Token budget from your cost model → max context and output caps in the PRD
  • Failure taxonomy from evals → negative instructions and abstain rules
  • Cacheable system prefix → stable policy block designed for prompt caching
  • Revisit triggers → prompt version changelog and regression evals

A weak prompt on the right model still fails evals. A strong prompt on the wrong tier still fails margin. You need both—which is why the learning path treats economics and prompting as adjacent modules, not rivals.

Chapter Summary

ArtifactPurpose
Selection matrixCompare candidates on product-weighted dimensions
Cost blockSymbolic formula + 10k / 100k / 1M scenarios
Quality blockGolden set, metrics, thresholds, HITL
Risk registerMitigations with owners
Example memoTemplate for grounded, regulated workflows
Self-check10 questions before sign-off
Next modulePrompt engineering for reliable instructions

Closing Thought

Model selection is where PM credibility in AI products is won. Not by naming the trendiest endpoint, but by documenting a choice that still makes sense when usage 10× and the vendor changes the price list.

Finish your memo. Run the self-check. Then move to prompts—the layer that turns a defensible model choice into a defensible user experience.

Module 02 complete

You can compare families, read benchmarks skeptically, model token economics, and sign a selection memo. That is the baseline for every AI feature you ship next.

Chapter navigation

← Previous

Chapter 5: Cost per Token, Latency, and Model Tier Tradeoffs — The PM Version

Unit economics, tiers, routing, caching, and the monthly cost formula.

Read chapter →
Next module →

Module 03: Prompt Engineering — Techniques & Structure

System prompts, CoT, delimiters, chaining, and production eval rubrics for prompts.

Coming Soon