Introduction
In Chapter 1, we built the map: families, tiers, closed vs open-weight, and why “best model” is the wrong question.
This chapter is the working session — how to compare GPT, Claude, Gemini, and Llama on your tasks without turning the exercise into brand loyalty or benchmark theater.
Same prompts, same rubric, same constraints — then decide.
The simple PM version
Five representative prompts → score with a rubric → decision matrix → memo-ready recommendation.
1. Why Side-by-Side Comparison Beats Hype
Sales demos cherry-pick prompts. Twitter threads hide failure modes. A PM comparison is boring on purpose: fixed inputs, blind scoring where possible, and notes on cost, latency, and policy fit.
You are not picking a mascot. You are choosing a default inference path for a workflow that will run thousands of times a day.
PM takeaway
Schedule comparison time before the eng spike — not after launch when switching is expensive.
2. Comparison Dimensions — Score What Moves Your KPI
| Dimension | How to measure (practical) |
|---|---|
| Task success | Pass/fail on rubric items; weighted by severity |
| Grounding | Cites provided context; no invented IDs |
| Format reliability | Valid JSON / schema on N runs |
| Tone & policy | Refusal when required; no unsafe shortcuts |
| Latency | p50/p95 for typical output length |
| Cost | Tokens in+out × current price sheet — per successful task |
| Operability | Tool calling, streaming, logging hooks available |
PM takeaway
Align dimension weights to workflow risk — medical triage weights grounding higher than marketing copy variants.
3. GPT Family — When It Fits
The GPT line from OpenAI is often the default general assistant in enterprises: strong tool ecosystems, wide third-party integrations, and familiar API patterns for engineers.
PM strengths to test: multi-tool agents, structured outputs, broad developer examples, enterprise procurement paths.
PM risks to test: overconfident answers without retrieval, policy drift across model versions, cost at frontier tier on long contexts.
Do not assume “GPT” means one behavior — specify the exact model ID and version in evals.
PM takeaway
GPT is a strong default to disprove with your five prompts — not to accept without evidence.
4. Claude Family — When It Fits
Anthropic’s Claude line is frequently chosen for long-document workflows, careful refusals, and teams already standardized on Anthropic safety docs.
PM strengths to test: nuanced policy language, large prompt + document bundles (with retrieval still required), structured editing tasks.
PM risks to test: false refusals on edge-case support queries, latency on very long outputs, integration gaps if your stack assumes OpenAI schemas.
PM takeaway
If your product is document-heavy, Claude belongs in the comparison set — not on reputation alone.
5. Gemini Family — When It Fits
Google’s Gemini line matters when your product already lives in Google Cloud / Workspace, or when multimodal features must align with Google media pipelines.
PM strengths to test: multimodal inputs in Google-centric stacks, batch flows on GCP, teams with existing Google contracts.
PM risks to test: cross-cloud complexity if the rest of the stack is AWS-only, inconsistent behavior across Gemini sizes, eval gaps on your domain jargon.
PM takeaway
Gemini is often a platform decision as much as a model decision — factor existing cloud commitments.
6. Llama Family — When It Fits
Meta’s Llama open-weight line is the usual starting point when you must self-host, fine-tune on permitted data, or run high-volume inference with hardware you control.
PM strengths to test: data residency, custom fine-tunes, cost curves at very high QPS after GPU amortization.
PM risks to test: safety tuning ownership, tool-use reliability vs closed APIs, engineering load for upgrades and eval pipelines.
PM takeaway
Choose Llama when control and economics at scale beat managed API convenience — and budget platform headcount.
7. Five-Prompt Comparison Exercise
Run the same five prompts on each candidate model ID with identical system instructions and temperature settings. Suggested set (adapt to your domain):
- Structured extraction — “Return JSON with fields X,Y,Z from this messy paragraph.”
- Grounded Q&A — answer only from a provided policy excerpt; include citation sentences.
- Tool-planning — given a fake API schema, produce a valid tool call sequence (no live side effects).
- Refusal edge case — request that should be declined or escalated per your policy.
- Long-context stress — realistic prompt size your feature will send (RAG + history), not a one-liner.
Scoring: 0–2 per rubric row (fail / partial / pass). Multiply by weights. Run at least 3 seeds or runs where nondeterminism matters.
| Prompt # | Weight (example) | Notes |
|---|---|---|
| 1 Extraction | 15% | JSON validity rate |
| 2 Grounded Q&A | 30% | Hallucination = automatic fail |
| 3 Tool plan | 20% | Schema match |
| 4 Refusal | 15% | Policy alignment |
| 5 Long context | 20% | Uses provided material |
PM takeaway
Store prompts and outputs in a shared folder — your model selection memo appendix should be reproducible.
8. Decision Matrix — From Scores to Recommendation
After scoring, fill a matrix like this (example structure — replace scores with your run):
| Candidate | Weighted quality | Est. cost / task | p95 latency | Data fit | Ops fit | Notes |
|---|---|---|---|---|---|---|
| GPT (tier: ___) | — | — | — | — | — | |
| Claude (tier: ___) | — | — | — | — | — | |
| Gemini (tier: ___) | — | — | — | — | — | |
| Llama (host: ___) | — | — | — | — | — |
Decision rules (examples):
- If grounded Q&A fails → disqualify for that workflow regardless of other scores.
- If cost / task is 3× higher with <5% quality gain → default to mid tier or cascade.
- If data terms block enterprise → remove from shortlist before quality debates.
PM takeaway
End with a primary + fallback model ID and explicit triggers to escalate tier or switch vendor.
9. You Rarely Pick Just One — Routing and Cascade
Production systems often use a small model to classify or draft, then escalate to a larger model on low confidence. Your comparison should note which steps can be down-tiered without hurting the user-visible outcome.
| Step | Candidate tier | Escalation trigger |
|---|---|---|
| Intent routing | Small | Low confidence or policy flag |
| Draft answer | Mid | Eval score < threshold |
| Appeals / edge case | Frontier | Human queue or explicit user opt-in |
10. Common PM Mistakes
| Mistake | Why it fails | Fix |
|---|---|---|
| Different prompts per vendor | Optimizes for charisma, not capability | Identical prompt pack |
| One-run conclusions | Sampling noise misleads | Multiple runs + rubric |
| Ignoring cost at scale | Wins demo, loses unit economics | Cost per successful outcome |
| No fallback model | Outage = feature off | Secondary ID + degraded mode UX |
| Comparing unrelated tiers | Apples to oranges | Match tier intent (mid vs mid) |
Chapter Summary
| Step | Output |
|---|---|
| Define dimensions | Weighted rubric tied to risk |
| Five-prompt pack | Reproducible eval appendix |
| Decision matrix | Primary + fallback + escalation rules |
| Cascade map | Where small/mid/frontier earn their keep |
Closing Thought
Vendors will keep publishing comparisons that favor their house model. Your five-prompt pack is the antidote — evidence your team can re-run after the next release.
Chapter 3 covers multimodal inputs: when vision, audio, code, and document models belong in the path — and when simpler OCR or specialized models are enough.
The real PM lesson
Defensible model choice is a repeatable experiment — not a one-time committee vote.
Chapter navigation
Chapter 1: Model Families and the Competitive Landscape
Map families, tiers, and deployment paths before you compare names.
Read chapter → Next →Chapter 3: Multimodal AI — Vision, Audio, Code, and Documents
Choose the right modality and model type for each step in the workflow.
Read chapter →