Chapter 2: Choosing Between GPT, Claude, Gemini, and Llama

Introduction

In Chapter 1, we built the map: families, tiers, closed vs open-weight, and why “best model” is the wrong question.

This chapter is the working session — how to compare GPT, Claude, Gemini, and Llama on your tasks without turning the exercise into brand loyalty or benchmark theater.

Same prompts, same rubric, same constraints — then decide.

The simple PM version

Five representative prompts → score with a rubric → decision matrix → memo-ready recommendation.

1. Why Side-by-Side Comparison Beats Hype

Sales demos cherry-pick prompts. Twitter threads hide failure modes. A PM comparison is boring on purpose: fixed inputs, blind scoring where possible, and notes on cost, latency, and policy fit.

You are not picking a mascot. You are choosing a default inference path for a workflow that will run thousands of times a day.

PM takeaway

Schedule comparison time before the eng spike — not after launch when switching is expensive.

2. Comparison Dimensions — Score What Moves Your KPI

Dimension	How to measure (practical)
Task success	Pass/fail on rubric items; weighted by severity
Grounding	Cites provided context; no invented IDs
Format reliability	Valid JSON / schema on N runs
Tone & policy	Refusal when required; no unsafe shortcuts
Latency	p50/p95 for typical output length
Cost	Tokens in+out × current price sheet — per successful task
Operability	Tool calling, streaming, logging hooks available

PM takeaway

Align dimension weights to workflow risk — medical triage weights grounding higher than marketing copy variants.

3. GPT Family — When It Fits

The GPT line from OpenAI is often the default general assistant in enterprises: strong tool ecosystems, wide third-party integrations, and familiar API patterns for engineers.

PM strengths to test: multi-tool agents, structured outputs, broad developer examples, enterprise procurement paths.

PM risks to test: overconfident answers without retrieval, policy drift across model versions, cost at frontier tier on long contexts.

Do not assume “GPT” means one behavior — specify the exact model ID and version in evals.

PM takeaway

GPT is a strong default to disprove with your five prompts — not to accept without evidence.

4. Claude Family — When It Fits

Anthropic’s Claude line is frequently chosen for long-document workflows, careful refusals, and teams already standardized on Anthropic safety docs.

PM strengths to test: nuanced policy language, large prompt + document bundles (with retrieval still required), structured editing tasks.

PM risks to test: false refusals on edge-case support queries, latency on very long outputs, integration gaps if your stack assumes OpenAI schemas.

PM takeaway

If your product is document-heavy, Claude belongs in the comparison set — not on reputation alone.

5. Gemini Family — When It Fits

Google’s Gemini line matters when your product already lives in Google Cloud / Workspace, or when multimodal features must align with Google media pipelines.

PM strengths to test: multimodal inputs in Google-centric stacks, batch flows on GCP, teams with existing Google contracts.

PM risks to test: cross-cloud complexity if the rest of the stack is AWS-only, inconsistent behavior across Gemini sizes, eval gaps on your domain jargon.

PM takeaway

Gemini is often a platform decision as much as a model decision — factor existing cloud commitments.

6. Llama Family — When It Fits

Meta’s Llama open-weight line is the usual starting point when you must self-host, fine-tune on permitted data, or run high-volume inference with hardware you control.

PM strengths to test: data residency, custom fine-tunes, cost curves at very high QPS after GPU amortization.

PM risks to test: safety tuning ownership, tool-use reliability vs closed APIs, engineering load for upgrades and eval pipelines.

PM takeaway

Choose Llama when control and economics at scale beat managed API convenience — and budget platform headcount.

7. Five-Prompt Comparison Exercise

Run the same five prompts on each candidate model ID with identical system instructions and temperature settings. Suggested set (adapt to your domain):

Structured extraction — “Return JSON with fields X,Y,Z from this messy paragraph.”
Grounded Q&A — answer only from a provided policy excerpt; include citation sentences.
Tool-planning — given a fake API schema, produce a valid tool call sequence (no live side effects).
Refusal edge case — request that should be declined or escalated per your policy.
Long-context stress — realistic prompt size your feature will send (RAG + history), not a one-liner.

Scoring: 0–2 per rubric row (fail / partial / pass). Multiply by weights. Run at least 3 seeds or runs where nondeterminism matters.

Prompt #	Weight (example)	Notes
1 Extraction	15%	JSON validity rate
2 Grounded Q&A	30%	Hallucination = automatic fail
3 Tool plan	20%	Schema match
4 Refusal	15%	Policy alignment
5 Long context	20%	Uses provided material

PM takeaway

Store prompts and outputs in a shared folder — your model selection memo appendix should be reproducible.

8. Decision Matrix — From Scores to Recommendation

After scoring, fill a matrix like this (example structure — replace scores with your run):

Candidate	Weighted quality	Est. cost / task	p95 latency	Data fit	Ops fit
GPT (tier: ___)	—	—	—	—	—
Claude (tier: ___)	—	—	—	—	—
Gemini (tier: ___)	—	—	—	—	—
Llama (host: ___)	—	—	—	—	—

Decision rules (examples):

If grounded Q&A fails → disqualify for that workflow regardless of other scores.
If cost / task is 3× higher with <5% quality gain → default to mid tier or cascade.
If data terms block enterprise → remove from shortlist before quality debates.

PM takeaway

End with a primary + fallback model ID and explicit triggers to escalate tier or switch vendor.

9. You Rarely Pick Just One — Routing and Cascade

Production systems often use a small model to classify or draft, then escalate to a larger model on low confidence. Your comparison should note which steps can be down-tiered without hurting the user-visible outcome.

Step	Candidate tier	Escalation trigger
Intent routing	Small	Low confidence or policy flag
Draft answer	Mid	Eval score < threshold
Appeals / edge case	Frontier	Human queue or explicit user opt-in

10. Common PM Mistakes

Mistake	Why it fails	Fix
Different prompts per vendor	Optimizes for charisma, not capability	Identical prompt pack
One-run conclusions	Sampling noise misleads	Multiple runs + rubric
Ignoring cost at scale	Wins demo, loses unit economics	Cost per successful outcome
No fallback model	Outage = feature off	Secondary ID + degraded mode UX
Comparing unrelated tiers	Apples to oranges	Match tier intent (mid vs mid)

Chapter Summary

Step	Output
Define dimensions	Weighted rubric tied to risk
Five-prompt pack	Reproducible eval appendix
Decision matrix	Primary + fallback + escalation rules
Cascade map	Where small/mid/frontier earn their keep

Closing Thought

Vendors will keep publishing comparisons that favor their house model. Your five-prompt pack is the antidote — evidence your team can re-run after the next release.

Chapter 3 covers multimodal inputs: when vision, audio, code, and document models belong in the path — and when simpler OCR or specialized models are enough.

The real PM lesson

Defensible model choice is a repeatable experiment — not a one-time committee vote.

Chapter navigation

← Previous

Chapter 1: Model Families and the Competitive Landscape

Map families, tiers, and deployment paths before you compare names.

Read chapter → Next →

Chapter 3: Multimodal AI — Vision, Audio, Code, and Documents

Choose the right modality and model type for each step in the workflow.

Read chapter →

← Chapter 1 Back to Module Back to Blog AI Learning