Introduction
In Module 01, Chapter 11, we closed the mechanics loop: cutoffs, private knowledge, reasoning limits, and why the product stack must own truth — not the model alone.
This module shifts from how models fail to which models you pick and what that choice costs. You do not need a favorite lab. You need a decision framework that survives the next release cycle.
Compare models like infrastructure choices — not brand fandom.
The simple PM version
Use case → Constraints → Eval → Tier → Vendor path.
The “best model” headline is marketing; your memo is engineering plus economics.
1. Why PMs Need a Model Landscape Map
Engineers read release notes. Leadership reads headlines. Finance reads invoices. PMs sit in the middle — translating capability into scope, cost into roadmap, and risk into UX.
Without a landscape map, teams default to:
- whatever the last demo used,
- whatever procurement already signed,
- whatever scored highest on a benchmark none of your users trigger.
A landscape map is not a vendor brochure. It is a taxonomy: who builds foundation models, how they are licensed, what tiers exist, and which dimensions actually move your product.
PM takeaway
Your job is to name the decision dimensions before the sprint picks a model ID from habit.
2. Foundation Model Families — The PM View
A family is a line of models from one builder with shared training philosophy, safety posture, and API shape. Names change weekly; families are the stable mental buckets.
| Family | Builder lineage (PM shorthand) | Typical product association |
|---|---|---|
| GPT | OpenAI | Broad general assistant, tool-rich APIs, enterprise adoption |
| Claude | Anthropic | Long documents, careful tone, enterprise policy workflows |
| Gemini | Workspace-adjacent products, multimodal in Google stack | |
| Llama | Meta (open weights) | Self-host, customize, air-gapped or cost-controlled inference |
| Mistral | Mistral AI | Efficient open and hosted options, EU footprint considerations |
Families overlap in capability. Differences show up in how you deploy: hosted API vs your VPC, default refusals, context packaging, tool schemas, and commercial terms — not in who “wins” a generic IQ contest.
PM takeaway
Learn families as deployment and policy buckets first; benchmark scores second.
3. Closed APIs vs Open-Weight Models
Closed API models run on the provider’s stack. You get fast iteration, managed safety layers, and predictable SLAs — with data-handling terms and per-token economics you do not control.
Open-weight models ship weights you can host. You gain deployment control, fine-tuning freedom, and potentially lower marginal cost at scale — with engineering ownership for inference, patching, evals, and guardrails.
| Dimension | Closed API | Open-weight (self/hosted) |
|---|---|---|
| Time to first prototype | Usually faster | Slower — infra required |
| Data residency | Contract + region choices | You define boundary |
| Customization | Prompting, tools, limited fine-tune | Fine-tune, distill, route locally |
| Cost at huge volume | Negotiate enterprise; still per call | CapEx/OpEx tradeoff; unit cost can fall |
| Upgrade path | Provider deprecates models | You own migration testing |
PM takeaway
Open-weight is not “free.” It shifts spend from API bills to platform teams, GPUs, and reliability work.
4. Frontier, Mid, and Small Tiers
Providers ship multiple sizes per family. PMs should think in tiers, not a single model name:
- Frontier — hardest tasks, highest cost/latency, widest modality; use sparingly on critical paths.
- Mid — default production tier for most user-facing reasoning with guardrails.
- Small — classification, routing, extraction, high-volume chat turns; pairs well with cascade architectures.
Tier strategy is a product decision: route easy work to small models, escalate on uncertainty signals, and cap frontier usage with budgets and feature flags.
PM takeaway
Spec tier routing in the PRD — not “we’ll use the big model everywhere and optimize later.”
5. General vs Specialized Models
General models aim for broad language, reasoning, and tool use across domains. Specialized models target a modality or task: code completion, embeddings, rerankers, speech, OCR, medical/legal tuned variants.
| When general wins | When specialized wins |
|---|---|
| Multi-step assistant with tools | Embedding search at billion-scale |
| Policy Q&A with RAG | On-device wake-word classifiers |
| Draft-and-edit copilots | Deterministic code completion in IDE |
Anti-pattern: forcing a general frontier model to do cheap perception work (e.g., simple layout OCR) because it is the only model ID in the repo.
PM takeaway
Map each workflow step to the cheapest competent model type — general or specialized.
6. PM Comparison Table — Dimensions That Matter
Do not maintain a static “winner” column. Maintain fit against your constraints. Refresh cells when you run evals — not when a blog post drops.
| Dimension | What to ask | Why PMs care |
|---|---|---|
| Task quality | Pass rate on your eval set? | Defines MVP bar and regression gates |
| Context | Enough window for real prompts + RAG? | Affects architecture from Chapter 9 |
| Latency | p95 at target output length? | UX and agent loop cost |
| Cost | $ per successful outcome at projected volume? | Unit economics and tier routing |
| Safety / refusal | False refusals vs leaks on your policy? | Support load and compliance |
| Tool use | Reliable function calling for your schemas? | Agent features live or die here |
| Data terms | Training opt-out, retention, regions? | Legal and enterprise sales |
| Operability | Observability, versioning, fallbacks? | Incident response and A/B tests |
PM takeaway
Publish this table inside your model selection memo — empty cells are risks you have not measured yet.
7. Competitive Landscape — Builders, Hosters, and Integrators
The market has three layers PMs should separate:
- Foundation builders — train base models (OpenAI, Anthropic, Google, Meta, Mistral, others).
- Hosters / clouds — sell inference endpoints, GPUs, and managed fine-tuning (hyperscalers, Together, Fireworks, etc.).
- Integrators — frameworks and gateways (LangChain, LlamaIndex, LiteLLM, internal platforms) that abstract providers.
Your product rarely commits to only one layer. You might call Anthropic on Bedrock, route Llama on your cluster, and log everything through an internal gateway. PM value is making those layers explicit in architecture diagrams and cost attribution.
PM takeaway
Ask “who owns the SLA for this call?” — builder, hoster, or your platform team.
8. The Wrong Question: “What Is the Best Model?”
Best for whom, on what task, under which constraints, measured how?
Public leaderboards answer benchmark authors’ tasks. Your users bring messy PDFs, acronyms, tool schemas, and compliance language. A model that tops a multiple-choice science suite may still fail your prior-auth letter workflow.
Replace “best” with:
- Sufficient — meets quality bar with acceptable cost/latency.
- Replaceable — abstraction layer so you can swap providers.
- Observable — you can detect regressions when versions change.
PM takeaway
Ban the word “best” in steering meetings. Require “sufficient for workflow X with evidence Y.”
9. Qualities by Use Case — Fit Beats Fame
| Use case pattern | Prioritize | Deprioritize |
|---|---|---|
| High-stakes Q&A with citations | Grounding, refusal calibration, audit logs | Creative temperature |
| Agent with many tool steps | Function-call reliability, latency at medium context | Poetic prose quality |
| High-volume triage | Small-tier cost, consistent JSON | Frontier reasoning |
| Document-heavy review | Effective context + retrieval strategy | Single-shot “read everything” myths |
| Regulated self-host | Data boundary, patch process, eval ownership | Day-one benchmark rank |
PM takeaway
Write use-case-specific acceptance criteria before comparing model names in a spreadsheet.
10. New Model Announcement Checklist
When marketing drops a new flagship name, run this checklist before rewriting the roadmap:
- What tier does it replace — frontier, mid, or small?
- Context window and modality — does it change RAG vs long-context assumptions?
- Pricing model — input vs output vs cached tokens; any batch endpoints?
- Safety delta — more refusals or fewer on your policy eval?
- Breaking changes — deprecated endpoints, new tool schema, different JSON modes?
- Your eval suite — schedule regression on top 5 production workflows, not a demo prompt.
- Rollback — can you flip a feature flag to the prior model ID in one deploy?
PM takeaway
Treat launches as migration projects with rollback — not as automatic upgrades.
11. Common PM Mistakes
| Mistake | Why it hurts | Instead |
|---|---|---|
| Single-model mandate | No routing; runaway cost | Tier + cascade in architecture |
| Benchmark-driven roadmap | Optimizes tasks users never do | Product eval harness first |
| Ignoring data terms | Deal-blocker late in enterprise cycle | Legal review in selection memo |
| Confusing family with deployment | “We use Llama” but only via one vendor API | Document actual inference path |
| No version pin | Silent behavior drift in production | Pin model IDs; monitor regressions |
Chapter Summary
| Concept | PM understanding |
|---|---|
| Landscape map | Taxonomy of families, tiers, and deployment paths |
| Families | GPT, Claude, Gemini, Llama, Mistral as buckets — not religions |
| Closed vs open | Speed vs control; cost shifts to platform when self-hosting |
| Tiers | Frontier / mid / small for routing and economics |
| General vs specialized | Right model type per workflow step |
| “Best model” | Wrong question — sufficient, replaceable, observable |
| Announcements | Checklist + eval + rollback before hype upgrades |
Closing Thought
The landscape will keep adding names. Your advantage is a stable decision language: constraints, evals, tiers, and deployment paths. Chapter 2 applies that language to a hands-on comparison between GPT, Claude, Gemini, and Llama on your prompts.
The real PM lesson
Model choice is portfolio management — not picking a winner on social media.
Chapter navigation
Module 01, Chapter 11: Hallucinations, Knowledge Cutoffs, and Model Limitations
Why strong models still fail — and how PMs design around cutoffs and limits.
Read chapter → Next →Chapter 2: Choosing Between GPT, Claude, Gemini, and Llama — The PM Version
Run the same five prompts across families and build a decision matrix you can defend.
Read chapter →