Introduction
In Chapter 2, we compared model families — GPT, Claude, Gemini, Llama — and how to pick a provider for a use case.
Most real products are not text-only. Users upload PDFs, photos of forms, call recordings, screenshots, spreadsheets, and code repos. “Multimodal” is the label vendors use when a model can accept more than plain text in a single request.
That sounds simple. In production it is not. The same model that summarizes a clean PDF may misread a stamped invoice, skip a chart legend, or confidently describe an image detail that is not there.
This chapter is the PM map: what each modality is good for, where it breaks, how workflows should look, and when a dedicated OCR or speech pipeline beats “just send it to the vision model.”
The simple PM version
Multimodal = one model interface for multiple input types.
It does not remove the need for upload UX, preprocessing, validation, or human review.
Treat perception and reasoning as separate product risks.
1. What Multimodal Means
A multimodal model can take inputs beyond text — commonly images, audio, and sometimes video — and produce text (or other outputs) in one pass. Under the hood, non-text inputs are usually converted into token-like representations the transformer can attend over, alongside your prompt.
“Multimodal” is not one capability level. Vendors differ on:
- which file types are supported,
- max resolution and page count,
- whether audio is native or transcribed first,
- whether code is “seen” as files or only pasted snippets,
- pricing per image, per minute, or per token.
For PMs, the useful definition is operational: Can my product send this asset type to the model API in one call, and get a structured answer back? If yes, you still must ask whether that path is accurate and affordable at your volume.
| Term | PM meaning |
|---|---|
| Modality | A type of input or output (text, image, audio, video, code) |
| Native multimodal | Model API accepts the asset directly (e.g. image + prompt) |
| Pipeline multimodal | Specialized step converts asset to text, then LLM reasons |
| Perception | Reading what is in the file (text, layout, objects) |
| Reasoning | Interpreting meaning, rules, next actions |
2. Why PMs Should Care
Multimodal unlocks features that text-only chat cannot ship: “upload your bill and we explain the line items,” “summarize this call,” “review this UI screenshot,” “explain this repo.” It also expands failure surface — privacy, latency, cost, and wrong extractions that look authoritative.
| Product lever | Why multimodal matters |
|---|---|
| Time-to-value | Users already have PDFs and photos; typing it in is friction |
| Workflow fit | Ops, claims, legal, and support work in documents and scans |
| Unit economics | Images and long audio cost more tokens or separate metering |
| Trust | Wrong field extraction erodes trust faster than a vague chat reply |
| Compliance | PII in images/audio needs retention and access controls |
| Eval | You need golden files, not only golden prompts |
PM takeaway
Multimodal is a product shape decision, not a checkbox on the model picker. If the core job involves files, design the file path before you design the chat UI.
3. Modality Comparison — Strengths and Tradeoffs
Use this table in roadmap discussions — not as gospel, but to force explicit tradeoffs per feature.
| Modality | Typical strengths | Typical weaknesses | PM watchouts |
|---|---|---|---|
| Text | Cheap, fast, easy to log and diff | No layout; user must paste or OCR first | Context window still applies |
| Vision (images) | UI screenshots, photos, simple diagrams | Fine print, handwriting, dense tables | Resolution limits; hallucinated details |
| Documents (PDF) | End-to-end Q&A on mixed layouts | Multi-page cost; table drift | Versioning; need source highlighting |
| Audio | Meetings, calls, voice UX | Accents, overlap, compliance recording rules | Latency; diarization; transcript quality |
| Code | Explain, refactor suggestions, review | Large repos exceed context; may invent APIs | Run tests in CI; never trust merge without review |
| Video | Training, surveillance-style review (where allowed) | Very expensive; sparse frames miss events | Consent, retention, frame sampling strategy |
PM analogy
Multimodal models are like a generalist analyst who can glance at your attachments. Specialist OCR or speech engines are like dedicated scanners — narrower, sometimes more reliable on messy input.
4. Vision Use Cases — Images and Screenshots
Vision is strongest when the task is semantic understanding of what is visible, not when you need pixel-perfect transcription of a 40-column spreadsheet.
| Use case | What you ask the model | Product pattern |
|---|---|---|
| UI / UX review | “What is wrong with this checkout screen?” | Designer uploads screenshot; structured critique |
| Field ops | “Describe damage in this photo” | Mobile capture + categories + human confirm |
| Retail / catalog | “Identify product type from shelf photo” | Low-stakes suggestions; catalog lookup for SKU |
| Accessibility alt text | “Describe image for screen readers” | Human review for sensitive images |
| Chart / diagram explain | “Summarize trend in this chart” | Risk: wrong axis or legend; show image + answer |
| Identity / KYC (high risk) | Document photo classification | Often regulated — prefer specialist + fraud rules |
PM takeaway
For vision, ship confidence and “show your work” — display the image region or quote the visible label when possible. Do not hide the asset while showing only the model’s paraphrase.
5. Document Use Cases — PDFs, Scans, and Forms
Documents are where multimodal hype meets operations reality. Claims, invoices, contracts, and medical records mix text, tables, stamps, and handwriting.
| Use case | Desired output | Controls to plan |
|---|---|---|
| Intake summarization | Short summary + key fields | Field schema; missing-doc detection |
| Structured extraction | JSON: dates, amounts, diagnosis codes | Validation rules; SOR cross-check |
| Policy / contract Q&A | Answer with clause reference | Versioned corpus; cite page/section |
| Comparison | Diff two PDF versions | Highlight changes; legal review gate |
| Redaction assist | Flag PII regions | Human approval before export |
| Search across repository | Find similar clauses | Often RAG + embeddings, not one-shot vision |
Connects to grounding lessons from Module 01 — Why LLMs Hallucinate: extraction without citation is still hallucination risk, just with a PDF attached.
6. Audio Use Cases — Voice and Recordings
Audio products usually chain speech-to-text then LLM reasoning, unless the vendor offers a single native audio endpoint. PMs should spec both steps — transcript quality caps everything downstream.
| Use case | Output | Design notes |
|---|---|---|
| Call summarization | Summary + action items | Speaker labels; consent; retention policy |
| Support QA | Score empathy, compliance phrases | Human calibration; bias review |
| Meeting notes | Decisions and owners | Low temperature; link to calendar metadata |
| Voice assistant | Spoken reply | Latency budget; barge-in; failure prompts |
| Medical / legal (high risk) | Transcript + draft | Licensed workflows; never auto-finalize |
PM takeaway
If the transcript is wrong, the summary will be wrong with confidence. Budget for transcript review UI on high-stakes flows.
7. Code Use Cases — Repos, Diffs, and Logs
Code is text structurally, but product-wise it behaves like a modality: large files, strict syntax, and verifiable truth via compilers and tests.
| Use case | What works well | What to avoid |
|---|---|---|
| Explain snippet | Small function + error message | Whole monorepo in one prompt |
| PR description draft | Diff summary | Auto-merge without CI |
| Test generation suggest | Happy-path unit tests | Assuming tests pass without running |
| Log triage | Pattern explanation | Invented stack frames |
| Migration assist | Step plan + examples | One-shot “migrate entire codebase” |
| Internal DSL / configs | With retrieved docs (RAG) | Relying on training memory for private APIs |
PM takeaway
For code features, acceptance criteria should include tooling: linter, typecheck, unit tests, security scan. The model proposes; the pipeline proves.
8. Limitations — What Breaks in Production
Multimodal does not fix core LLM limits. It adds perception errors on top of reasoning errors.
| Limitation | What users see | Mitigation direction |
|---|---|---|
| Resolution / compression | Unreadable small text | Preprocess PDF to text layer; tile pages |
| Handwriting & stamps | Wrong characters | OCR specialist; human correction UI |
| Dense tables | Shifted columns, wrong totals | Table extraction tool; validate arithmetic |
| Charts & infographics | Wrong trend direction | Require numeric table source |
| Long files | Truncation, skipped pages | Chunking; page-range prompts; map-reduce |
| Cost at scale | Feature becomes unprofitable | Route small docs to cheaper tier; cache extracts |
| Latency | Timeouts on mobile upload | Async jobs; progress UI |
| Privacy | PII in logs | Redact; regional storage; retention TTL |
PM takeaway
Write limitations into the PRD the same way you write latency SLOs. “Supports PDF” is not a requirement — “extracts these 12 fields at ≥X accuracy on golden set Y” is.
9. PM Workflows — From Demo to Production
A credible multimodal feature usually follows this workflow — adjust names to your org, keep the gates.
| Phase | PM activities | Exit criteria |
|---|---|---|
| Discover | List file types, fields, error cost, volume | Job story + risk tier |
| Prototype | 10–30 real files (redacted); compare native vs pipeline | Qualitative accuracy notes |
| Spec | Schema, confidence, HITL, retention, fallbacks | PRD with eval rubric |
| Build | Upload, preprocessing, model route, validation | Golden-set regression in CI |
| Pilot | Shadow mode or reviewer queue | Override rate within bound |
| Scale | Cost dashboards, drift monitoring | Unit economics within target |
Example: claims document intake
- User uploads bill + discharge summary.
- System classifies doc types and runs extraction (pipeline or multimodal).
- Validator checks required fields and formats.
- Model drafts summary with citations to page/line.
- Processor reviews low-confidence fields in side-by-side UI.
- Approved fields write to claim system of record.
PM takeaway
The model step is often the shortest part of the timeline. Upload, validation, review, and audit usually determine whether the feature ships.
10. Design Checklist — Before You Ship
Use this checklist in design review. Missing items are how multimodal pilots die in production.
| Area | Question |
|---|---|
| Input | Which file types, max size, max pages, mobile capture quality? |
| Preprocessing | Rotate, deskew, PDF text layer, split pages? |
| Output | Free text vs JSON schema vs both? |
| Evidence | Can user see source snippet for each field? |
| Confidence | Per-field scores? What happens below threshold? |
| HITL | Who corrects errors? How fast must queue drain? |
| Failure | Timeout, corrupt file, unsupported language? |
| Security | Encryption, access roles, audit log? |
| Eval | Golden file set + regression on model/prompt change? |
| Economics | Cost per doc at P50 and P95 size? |
11. Multimodal vs OCR — When to Use Which
OCR (optical character recognition) and layout parsers extract text and structure first; the LLM reasons on text. Native multimodal sends pixels or pages to the model directly.
| Factor | Lean toward OCR / layout pipeline | Lean toward native multimodal |
|---|---|---|
| Need exact field strings for billing | Yes — validate each field | Risky alone |
| Messy scans, stamps, handwriting | Specialist OCR + human correct | May help as assist, not sole path |
| Semantic Q&A on narrative PDF | After text extract + RAG | Strong for exploratory Q&A |
| UI screenshot understanding | Weak | Strong |
| Cost predictability | Often cheaper at high volume | Per-image/token can spike |
| Debuggability | Intermediate text is inspectable | Harder to audit “what it saw” |
| Latency | Two-step can be slower | One call can be simpler |
Common winning pattern: OCR or layout extraction for fields + multimodal or LLM for summary and classification — with one golden eval set covering the full chain.
PM takeaway
Do not treat “multimodal” and “OCR” as rivals. Treat them as layers. Pick extraction fidelity first, reasoning second.
Chapter Summary
| Concept | PM understanding |
|---|---|
| Multimodal | One model interface for multiple input types — not automatic accuracy |
| Perception vs reasoning | Reading the file and interpreting it are separate failure modes |
| Vision | Strong for screenshots and semantic image tasks; weak on fine print |
| Documents | High value; needs schema, citation, validation |
| Audio | Transcript quality dominates; compliance matters |
| Code | Propose in LLM; prove in CI |
| Workflow | Upload → extract → validate → review → SOR |
| vs OCR | Often combine pipeline extraction with LLM reasoning |
Closing Thought
Multimodal is why many AI products finally feel “real” to business users — the model meets them where their work already lives, in files and recordings.
It is also where demos diverge fastest from production. A slick upload button does not replace field validation, source highlighting, or reviewer queues.
As a PM, your job is to name the modality, name the acceptable error rate, and design the human and system checks around perception — before you argue about which logo is on the model card.
Next we look at benchmarks — what public scores actually measure, what they miss, and how to build evaluation your stakeholders should trust more than a leaderboard screenshot.
The real PM lesson
Ingest → Perceive → Reason → Verify. Skip Verify and multimodal becomes an expensive guessing machine.
Chapter navigation
Chapter 2: Choosing Between GPT, Claude, Gemini, and Llama — The PM Version
How to compare model families for your use case without chasing leaderboard hype.
Read chapter → Next →Chapter 4: Benchmarks — What PMs Should and Shouldn't Trust — The PM Version
MMLU, HumanEval, SWE-bench, and Chatbot Arena — what they measure and what your product still needs to test.
Read chapter →