Chapter 1: Model Families and the Competitive Landscape

Introduction

In Module 01, Chapter 11, we closed the mechanics loop: cutoffs, private knowledge, reasoning limits, and why the product stack must own truth — not the model alone.

This module shifts from how models fail to which models you pick and what that choice costs. You do not need a favorite lab. You need a decision framework that survives the next release cycle.

Compare models like infrastructure choices — not brand fandom.

The simple PM version

Use case → Constraints → Eval → Tier → Vendor path.
The “best model” headline is marketing; your memo is engineering plus economics.

1. Why PMs Need a Model Landscape Map

Engineers read release notes. Leadership reads headlines. Finance reads invoices. PMs sit in the middle — translating capability into scope, cost into roadmap, and risk into UX.

Without a landscape map, teams default to:

whatever the last demo used,
whatever procurement already signed,
whatever scored highest on a benchmark none of your users trigger.

A landscape map is not a vendor brochure. It is a taxonomy: who builds foundation models, how they are licensed, what tiers exist, and which dimensions actually move your product.

PM takeaway

Your job is to name the decision dimensions before the sprint picks a model ID from habit.

2. Foundation Model Families — The PM View

A family is a line of models from one builder with shared training philosophy, safety posture, and API shape. Names change weekly; families are the stable mental buckets.

Family	Builder lineage (PM shorthand)	Typical product association
GPT	OpenAI	Broad general assistant, tool-rich APIs, enterprise adoption
Claude	Anthropic	Long documents, careful tone, enterprise policy workflows
Gemini	Google	Workspace-adjacent products, multimodal in Google stack
Llama	Meta (open weights)	Self-host, customize, air-gapped or cost-controlled inference
Mistral	Mistral AI	Efficient open and hosted options, EU footprint considerations

Families overlap in capability. Differences show up in how you deploy: hosted API vs your VPC, default refusals, context packaging, tool schemas, and commercial terms — not in who “wins” a generic IQ contest.

PM takeaway

Learn families as deployment and policy buckets first; benchmark scores second.

3. Closed APIs vs Open-Weight Models

Closed API models run on the provider’s stack. You get fast iteration, managed safety layers, and predictable SLAs — with data-handling terms and per-token economics you do not control.

Open-weight models ship weights you can host. You gain deployment control, fine-tuning freedom, and potentially lower marginal cost at scale — with engineering ownership for inference, patching, evals, and guardrails.

Dimension	Closed API	Open-weight (self/hosted)
Time to first prototype	Usually faster	Slower — infra required
Data residency	Contract + region choices	You define boundary
Customization	Prompting, tools, limited fine-tune	Fine-tune, distill, route locally
Cost at huge volume	Negotiate enterprise; still per call	CapEx/OpEx tradeoff; unit cost can fall
Upgrade path	Provider deprecates models	You own migration testing

PM takeaway

Open-weight is not “free.” It shifts spend from API bills to platform teams, GPUs, and reliability work.

4. Frontier, Mid, and Small Tiers

Providers ship multiple sizes per family. PMs should think in tiers, not a single model name:

Frontier — hardest tasks, highest cost/latency, widest modality; use sparingly on critical paths.
Mid — default production tier for most user-facing reasoning with guardrails.
Small — classification, routing, extraction, high-volume chat turns; pairs well with cascade architectures.

Tier strategy is a product decision: route easy work to small models, escalate on uncertainty signals, and cap frontier usage with budgets and feature flags.

PM takeaway

Spec tier routing in the PRD — not “we’ll use the big model everywhere and optimize later.”

5. General vs Specialized Models

General models aim for broad language, reasoning, and tool use across domains. Specialized models target a modality or task: code completion, embeddings, rerankers, speech, OCR, medical/legal tuned variants.

When general wins	When specialized wins
Multi-step assistant with tools	Embedding search at billion-scale
Policy Q&A with RAG	On-device wake-word classifiers
Draft-and-edit copilots	Deterministic code completion in IDE

Anti-pattern: forcing a general frontier model to do cheap perception work (e.g., simple layout OCR) because it is the only model ID in the repo.

PM takeaway

Map each workflow step to the cheapest competent model type — general or specialized.

6. PM Comparison Table — Dimensions That Matter

Do not maintain a static “winner” column. Maintain fit against your constraints. Refresh cells when you run evals — not when a blog post drops.

Dimension	What to ask	Why PMs care
Task quality	Pass rate on your eval set?	Defines MVP bar and regression gates
Context	Enough window for real prompts + RAG?	Affects architecture from Chapter 9
Latency	p95 at target output length?	UX and agent loop cost
Cost	$ per successful outcome at projected volume?	Unit economics and tier routing
Safety / refusal	False refusals vs leaks on your policy?	Support load and compliance
Tool use	Reliable function calling for your schemas?	Agent features live or die here
Data terms	Training opt-out, retention, regions?	Legal and enterprise sales
Operability	Observability, versioning, fallbacks?	Incident response and A/B tests

PM takeaway

Publish this table inside your model selection memo — empty cells are risks you have not measured yet.

7. Competitive Landscape — Builders, Hosters, and Integrators

The market has three layers PMs should separate:

Foundation builders — train base models (OpenAI, Anthropic, Google, Meta, Mistral, others).
Hosters / clouds — sell inference endpoints, GPUs, and managed fine-tuning (hyperscalers, Together, Fireworks, etc.).
Integrators — frameworks and gateways (LangChain, LlamaIndex, LiteLLM, internal platforms) that abstract providers.

Your product rarely commits to only one layer. You might call Anthropic on Bedrock, route Llama on your cluster, and log everything through an internal gateway. PM value is making those layers explicit in architecture diagrams and cost attribution.

PM takeaway

Ask “who owns the SLA for this call?” — builder, hoster, or your platform team.

8. The Wrong Question: “What Is the Best Model?”

Best for whom, on what task, under which constraints, measured how?

Public leaderboards answer benchmark authors’ tasks. Your users bring messy PDFs, acronyms, tool schemas, and compliance language. A model that tops a multiple-choice science suite may still fail your prior-auth letter workflow.

Replace “best” with:

Sufficient — meets quality bar with acceptable cost/latency.
Replaceable — abstraction layer so you can swap providers.
Observable — you can detect regressions when versions change.

PM takeaway

Ban the word “best” in steering meetings. Require “sufficient for workflow X with evidence Y.”

9. Qualities by Use Case — Fit Beats Fame

Use case pattern	Prioritize	Deprioritize
High-stakes Q&A with citations	Grounding, refusal calibration, audit logs	Creative temperature
Agent with many tool steps	Function-call reliability, latency at medium context	Poetic prose quality
High-volume triage	Small-tier cost, consistent JSON	Frontier reasoning
Document-heavy review	Effective context + retrieval strategy	Single-shot “read everything” myths
Regulated self-host	Data boundary, patch process, eval ownership	Day-one benchmark rank

PM takeaway

Write use-case-specific acceptance criteria before comparing model names in a spreadsheet.

10. New Model Announcement Checklist

When marketing drops a new flagship name, run this checklist before rewriting the roadmap:

What tier does it replace — frontier, mid, or small?
Context window and modality — does it change RAG vs long-context assumptions?
Pricing model — input vs output vs cached tokens; any batch endpoints?
Safety delta — more refusals or fewer on your policy eval?
Breaking changes — deprecated endpoints, new tool schema, different JSON modes?
Your eval suite — schedule regression on top 5 production workflows, not a demo prompt.
Rollback — can you flip a feature flag to the prior model ID in one deploy?

PM takeaway

Treat launches as migration projects with rollback — not as automatic upgrades.

11. Common PM Mistakes

Mistake	Why it hurts	Instead
Single-model mandate	No routing; runaway cost	Tier + cascade in architecture
Benchmark-driven roadmap	Optimizes tasks users never do	Product eval harness first
Ignoring data terms	Deal-blocker late in enterprise cycle	Legal review in selection memo
Confusing family with deployment	“We use Llama” but only via one vendor API	Document actual inference path
No version pin	Silent behavior drift in production	Pin model IDs; monitor regressions

Chapter Summary

Concept	PM understanding
Landscape map	Taxonomy of families, tiers, and deployment paths
Families	GPT, Claude, Gemini, Llama, Mistral as buckets — not religions
Closed vs open	Speed vs control; cost shifts to platform when self-hosting
Tiers	Frontier / mid / small for routing and economics
General vs specialized	Right model type per workflow step
“Best model”	Wrong question — sufficient, replaceable, observable
Announcements	Checklist + eval + rollback before hype upgrades

Closing Thought

The landscape will keep adding names. Your advantage is a stable decision language: constraints, evals, tiers, and deployment paths. Chapter 2 applies that language to a hands-on comparison between GPT, Claude, Gemini, and Llama on your prompts.

The real PM lesson

Model choice is portfolio management — not picking a winner on social media.

Chapter navigation

← Previous

Module 01, Chapter 11: Hallucinations, Knowledge Cutoffs, and Model Limitations

Why strong models still fail — and how PMs design around cutoffs and limits.

Read chapter → Next →

Chapter 2: Choosing Between GPT, Claude, Gemini, and Llama — The PM Version

Run the same five prompts across families and build a decision matrix you can defend.

Read chapter →

← Module 01 Ch11 Chapter 02 → Back to Module Back to Blog AI Learning