Chapter 02 · Module 02 · Beginner–Intermediate · 26–30 min

Chapter 2: Choosing Between GPT, Claude, Gemini, and Llama — The PM Version

Same prompts, same rubric — build a decision matrix you can defend in a model selection memo.

Book: AI Learning Beginner–Intermediate 26–30 min
Start reading Back to module
Prompts Rubric Matrix

Evidence beats hype — rerun the pack after every major release

Introduction

In Chapter 1, we built the map: families, tiers, closed vs open-weight, and why “best model” is the wrong question.

This chapter is the working session — how to compare GPT, Claude, Gemini, and Llama on your tasks without turning the exercise into brand loyalty or benchmark theater.

Same prompts, same rubric, same constraints — then decide.

The simple PM version

Five representative prompts → score with a rubric → decision matrix → memo-ready recommendation.

1. Why Side-by-Side Comparison Beats Hype

Sales demos cherry-pick prompts. Twitter threads hide failure modes. A PM comparison is boring on purpose: fixed inputs, blind scoring where possible, and notes on cost, latency, and policy fit.

You are not picking a mascot. You are choosing a default inference path for a workflow that will run thousands of times a day.

PM takeaway

Schedule comparison time before the eng spike — not after launch when switching is expensive.

2. Comparison Dimensions — Score What Moves Your KPI

DimensionHow to measure (practical)
Task successPass/fail on rubric items; weighted by severity
GroundingCites provided context; no invented IDs
Format reliabilityValid JSON / schema on N runs
Tone & policyRefusal when required; no unsafe shortcuts
Latencyp50/p95 for typical output length
CostTokens in+out × current price sheet — per successful task
OperabilityTool calling, streaming, logging hooks available

PM takeaway

Align dimension weights to workflow risk — medical triage weights grounding higher than marketing copy variants.

3. GPT Family — When It Fits

The GPT line from OpenAI is often the default general assistant in enterprises: strong tool ecosystems, wide third-party integrations, and familiar API patterns for engineers.

PM strengths to test: multi-tool agents, structured outputs, broad developer examples, enterprise procurement paths.

PM risks to test: overconfident answers without retrieval, policy drift across model versions, cost at frontier tier on long contexts.

Do not assume “GPT” means one behavior — specify the exact model ID and version in evals.

PM takeaway

GPT is a strong default to disprove with your five prompts — not to accept without evidence.

4. Claude Family — When It Fits

Anthropic’s Claude line is frequently chosen for long-document workflows, careful refusals, and teams already standardized on Anthropic safety docs.

PM strengths to test: nuanced policy language, large prompt + document bundles (with retrieval still required), structured editing tasks.

PM risks to test: false refusals on edge-case support queries, latency on very long outputs, integration gaps if your stack assumes OpenAI schemas.

PM takeaway

If your product is document-heavy, Claude belongs in the comparison set — not on reputation alone.

5. Gemini Family — When It Fits

Google’s Gemini line matters when your product already lives in Google Cloud / Workspace, or when multimodal features must align with Google media pipelines.

PM strengths to test: multimodal inputs in Google-centric stacks, batch flows on GCP, teams with existing Google contracts.

PM risks to test: cross-cloud complexity if the rest of the stack is AWS-only, inconsistent behavior across Gemini sizes, eval gaps on your domain jargon.

PM takeaway

Gemini is often a platform decision as much as a model decision — factor existing cloud commitments.

6. Llama Family — When It Fits

Meta’s Llama open-weight line is the usual starting point when you must self-host, fine-tune on permitted data, or run high-volume inference with hardware you control.

PM strengths to test: data residency, custom fine-tunes, cost curves at very high QPS after GPU amortization.

PM risks to test: safety tuning ownership, tool-use reliability vs closed APIs, engineering load for upgrades and eval pipelines.

PM takeaway

Choose Llama when control and economics at scale beat managed API convenience — and budget platform headcount.

7. Five-Prompt Comparison Exercise

Run the same five prompts on each candidate model ID with identical system instructions and temperature settings. Suggested set (adapt to your domain):

  1. Structured extraction — “Return JSON with fields X,Y,Z from this messy paragraph.”
  2. Grounded Q&A — answer only from a provided policy excerpt; include citation sentences.
  3. Tool-planning — given a fake API schema, produce a valid tool call sequence (no live side effects).
  4. Refusal edge case — request that should be declined or escalated per your policy.
  5. Long-context stress — realistic prompt size your feature will send (RAG + history), not a one-liner.

Scoring: 0–2 per rubric row (fail / partial / pass). Multiply by weights. Run at least 3 seeds or runs where nondeterminism matters.

Prompt #Weight (example)Notes
1 Extraction15%JSON validity rate
2 Grounded Q&A30%Hallucination = automatic fail
3 Tool plan20%Schema match
4 Refusal15%Policy alignment
5 Long context20%Uses provided material

PM takeaway

Store prompts and outputs in a shared folder — your model selection memo appendix should be reproducible.

8. Decision Matrix — From Scores to Recommendation

After scoring, fill a matrix like this (example structure — replace scores with your run):

CandidateWeighted qualityEst. cost / taskp95 latencyData fitOps fitNotes
GPT (tier: ___)
Claude (tier: ___)
Gemini (tier: ___)
Llama (host: ___)

Decision rules (examples):

  • If grounded Q&A fails → disqualify for that workflow regardless of other scores.
  • If cost / task is 3× higher with <5% quality gain → default to mid tier or cascade.
  • If data terms block enterprise → remove from shortlist before quality debates.

PM takeaway

End with a primary + fallback model ID and explicit triggers to escalate tier or switch vendor.

9. You Rarely Pick Just One — Routing and Cascade

Production systems often use a small model to classify or draft, then escalate to a larger model on low confidence. Your comparison should note which steps can be down-tiered without hurting the user-visible outcome.

StepCandidate tierEscalation trigger
Intent routingSmallLow confidence or policy flag
Draft answerMidEval score < threshold
Appeals / edge caseFrontierHuman queue or explicit user opt-in

10. Common PM Mistakes

MistakeWhy it failsFix
Different prompts per vendorOptimizes for charisma, not capabilityIdentical prompt pack
One-run conclusionsSampling noise misleadsMultiple runs + rubric
Ignoring cost at scaleWins demo, loses unit economicsCost per successful outcome
No fallback modelOutage = feature offSecondary ID + degraded mode UX
Comparing unrelated tiersApples to orangesMatch tier intent (mid vs mid)

Chapter Summary

StepOutput
Define dimensionsWeighted rubric tied to risk
Five-prompt packReproducible eval appendix
Decision matrixPrimary + fallback + escalation rules
Cascade mapWhere small/mid/frontier earn their keep

Closing Thought

Vendors will keep publishing comparisons that favor their house model. Your five-prompt pack is the antidote — evidence your team can re-run after the next release.

Chapter 3 covers multimodal inputs: when vision, audio, code, and document models belong in the path — and when simpler OCR or specialized models are enough.

The real PM lesson

Defensible model choice is a repeatable experiment — not a one-time committee vote.

Chapter navigation

← Previous

Chapter 1: Model Families and the Competitive Landscape

Map families, tiers, and deployment paths before you compare names.

Read chapter →
Next →

Chapter 3: Multimodal AI — Vision, Audio, Code, and Documents

Choose the right modality and model type for each step in the workflow.

Read chapter →