Multimodal AI: Vision, Audio, Code, and Documents — The PM Version

Introduction

In Chapter 2, we compared model families — GPT, Claude, Gemini, Llama — and how to pick a provider for a use case.

Most real products are not text-only. Users upload PDFs, photos of forms, call recordings, screenshots, spreadsheets, and code repos. “Multimodal” is the label vendors use when a model can accept more than plain text in a single request.

That sounds simple. In production it is not. The same model that summarizes a clean PDF may misread a stamped invoice, skip a chart legend, or confidently describe an image detail that is not there.

This chapter is the PM map: what each modality is good for, where it breaks, how workflows should look, and when a dedicated OCR or speech pipeline beats “just send it to the vision model.”

The simple PM version

Multimodal = one model interface for multiple input types.
It does not remove the need for upload UX, preprocessing, validation, or human review.
Treat perception and reasoning as separate product risks.

1. What Multimodal Means

A multimodal model can take inputs beyond text — commonly images, audio, and sometimes video — and produce text (or other outputs) in one pass. Under the hood, non-text inputs are usually converted into token-like representations the transformer can attend over, alongside your prompt.

“Multimodal” is not one capability level. Vendors differ on:

which file types are supported,
max resolution and page count,
whether audio is native or transcribed first,
whether code is “seen” as files or only pasted snippets,
pricing per image, per minute, or per token.

For PMs, the useful definition is operational: Can my product send this asset type to the model API in one call, and get a structured answer back? If yes, you still must ask whether that path is accurate and affordable at your volume.

Term	PM meaning
Modality	A type of input or output (text, image, audio, video, code)
Native multimodal	Model API accepts the asset directly (e.g. image + prompt)
Pipeline multimodal	Specialized step converts asset to text, then LLM reasons
Perception	Reading what is in the file (text, layout, objects)
Reasoning	Interpreting meaning, rules, next actions

2. Why PMs Should Care

Multimodal unlocks features that text-only chat cannot ship: “upload your bill and we explain the line items,” “summarize this call,” “review this UI screenshot,” “explain this repo.” It also expands failure surface — privacy, latency, cost, and wrong extractions that look authoritative.

Product lever	Why multimodal matters
Time-to-value	Users already have PDFs and photos; typing it in is friction
Workflow fit	Ops, claims, legal, and support work in documents and scans
Unit economics	Images and long audio cost more tokens or separate metering
Trust	Wrong field extraction erodes trust faster than a vague chat reply
Compliance	PII in images/audio needs retention and access controls
Eval	You need golden files, not only golden prompts

PM takeaway

Multimodal is a product shape decision, not a checkbox on the model picker. If the core job involves files, design the file path before you design the chat UI.

3. Modality Comparison — Strengths and Tradeoffs

Use this table in roadmap discussions — not as gospel, but to force explicit tradeoffs per feature.

Modality	Typical strengths	Typical weaknesses	PM watchouts
Text	Cheap, fast, easy to log and diff	No layout; user must paste or OCR first	Context window still applies
Vision (images)	UI screenshots, photos, simple diagrams	Fine print, handwriting, dense tables	Resolution limits; hallucinated details
Documents (PDF)	End-to-end Q&A on mixed layouts	Multi-page cost; table drift	Versioning; need source highlighting
Audio	Meetings, calls, voice UX	Accents, overlap, compliance recording rules	Latency; diarization; transcript quality
Code	Explain, refactor suggestions, review	Large repos exceed context; may invent APIs	Run tests in CI; never trust merge without review
Video	Training, surveillance-style review (where allowed)	Very expensive; sparse frames miss events	Consent, retention, frame sampling strategy

PM analogy

Multimodal models are like a generalist analyst who can glance at your attachments. Specialist OCR or speech engines are like dedicated scanners — narrower, sometimes more reliable on messy input.

4. Vision Use Cases — Images and Screenshots

Vision is strongest when the task is semantic understanding of what is visible, not when you need pixel-perfect transcription of a 40-column spreadsheet.

Use case	What you ask the model	Product pattern
UI / UX review	“What is wrong with this checkout screen?”	Designer uploads screenshot; structured critique
Field ops	“Describe damage in this photo”	Mobile capture + categories + human confirm
Retail / catalog	“Identify product type from shelf photo”	Low-stakes suggestions; catalog lookup for SKU
Accessibility alt text	“Describe image for screen readers”	Human review for sensitive images
Chart / diagram explain	“Summarize trend in this chart”	Risk: wrong axis or legend; show image + answer
Identity / KYC (high risk)	Document photo classification	Often regulated — prefer specialist + fraud rules

PM takeaway

For vision, ship confidence and “show your work” — display the image region or quote the visible label when possible. Do not hide the asset while showing only the model’s paraphrase.

5. Document Use Cases — PDFs, Scans, and Forms

Documents are where multimodal hype meets operations reality. Claims, invoices, contracts, and medical records mix text, tables, stamps, and handwriting.

Use case	Desired output	Controls to plan
Intake summarization	Short summary + key fields	Field schema; missing-doc detection
Structured extraction	JSON: dates, amounts, diagnosis codes	Validation rules; SOR cross-check
Policy / contract Q&A	Answer with clause reference	Versioned corpus; cite page/section
Comparison	Diff two PDF versions	Highlight changes; legal review gate
Redaction assist	Flag PII regions	Human approval before export
Search across repository	Find similar clauses	Often RAG + embeddings, not one-shot vision

Connects to grounding lessons from Module 01 — Why LLMs Hallucinate: extraction without citation is still hallucination risk, just with a PDF attached.

6. Audio Use Cases — Voice and Recordings

Audio products usually chain speech-to-text then LLM reasoning, unless the vendor offers a single native audio endpoint. PMs should spec both steps — transcript quality caps everything downstream.

Use case	Output	Design notes
Call summarization	Summary + action items	Speaker labels; consent; retention policy
Support QA	Score empathy, compliance phrases	Human calibration; bias review
Meeting notes	Decisions and owners	Low temperature; link to calendar metadata
Voice assistant	Spoken reply	Latency budget; barge-in; failure prompts
Medical / legal (high risk)	Transcript + draft	Licensed workflows; never auto-finalize

PM takeaway

If the transcript is wrong, the summary will be wrong with confidence. Budget for transcript review UI on high-stakes flows.

7. Code Use Cases — Repos, Diffs, and Logs

Code is text structurally, but product-wise it behaves like a modality: large files, strict syntax, and verifiable truth via compilers and tests.

Use case	What works well	What to avoid
Explain snippet	Small function + error message	Whole monorepo in one prompt
PR description draft	Diff summary	Auto-merge without CI
Test generation suggest	Happy-path unit tests	Assuming tests pass without running
Log triage	Pattern explanation	Invented stack frames
Migration assist	Step plan + examples	One-shot “migrate entire codebase”
Internal DSL / configs	With retrieved docs (RAG)	Relying on training memory for private APIs

PM takeaway

For code features, acceptance criteria should include tooling: linter, typecheck, unit tests, security scan. The model proposes; the pipeline proves.

8. Limitations — What Breaks in Production

Multimodal does not fix core LLM limits. It adds perception errors on top of reasoning errors.

Limitation	What users see	Mitigation direction
Resolution / compression	Unreadable small text	Preprocess PDF to text layer; tile pages
Handwriting & stamps	Wrong characters	OCR specialist; human correction UI
Dense tables	Shifted columns, wrong totals	Table extraction tool; validate arithmetic
Charts & infographics	Wrong trend direction	Require numeric table source
Long files	Truncation, skipped pages	Chunking; page-range prompts; map-reduce
Cost at scale	Feature becomes unprofitable	Route small docs to cheaper tier; cache extracts
Latency	Timeouts on mobile upload	Async jobs; progress UI
Privacy	PII in logs	Redact; regional storage; retention TTL

PM takeaway

Write limitations into the PRD the same way you write latency SLOs. “Supports PDF” is not a requirement — “extracts these 12 fields at ≥X accuracy on golden set Y” is.

9. PM Workflows — From Demo to Production

A credible multimodal feature usually follows this workflow — adjust names to your org, keep the gates.

Phase	PM activities	Exit criteria
Discover	List file types, fields, error cost, volume	Job story + risk tier
Prototype	10–30 real files (redacted); compare native vs pipeline	Qualitative accuracy notes
Spec	Schema, confidence, HITL, retention, fallbacks	PRD with eval rubric
Build	Upload, preprocessing, model route, validation	Golden-set regression in CI
Pilot	Shadow mode or reviewer queue	Override rate within bound
Scale	Cost dashboards, drift monitoring	Unit economics within target

Example: claims document intake

User uploads bill + discharge summary.
System classifies doc types and runs extraction (pipeline or multimodal).
Validator checks required fields and formats.
Model drafts summary with citations to page/line.
Processor reviews low-confidence fields in side-by-side UI.
Approved fields write to claim system of record.

PM takeaway

The model step is often the shortest part of the timeline. Upload, validation, review, and audit usually determine whether the feature ships.

10. Design Checklist — Before You Ship

Use this checklist in design review. Missing items are how multimodal pilots die in production.

Area	Question
Input	Which file types, max size, max pages, mobile capture quality?
Preprocessing	Rotate, deskew, PDF text layer, split pages?
Output	Free text vs JSON schema vs both?
Evidence	Can user see source snippet for each field?
Confidence	Per-field scores? What happens below threshold?
HITL	Who corrects errors? How fast must queue drain?
Failure	Timeout, corrupt file, unsupported language?
Security	Encryption, access roles, audit log?
Eval	Golden file set + regression on model/prompt change?
Economics	Cost per doc at P50 and P95 size?

11. Multimodal vs OCR — When to Use Which

OCR (optical character recognition) and layout parsers extract text and structure first; the LLM reasons on text. Native multimodal sends pixels or pages to the model directly.

Factor	Lean toward OCR / layout pipeline	Lean toward native multimodal
Need exact field strings for billing	Yes — validate each field	Risky alone
Messy scans, stamps, handwriting	Specialist OCR + human correct	May help as assist, not sole path
Semantic Q&A on narrative PDF	After text extract + RAG	Strong for exploratory Q&A
UI screenshot understanding	Weak	Strong
Cost predictability	Often cheaper at high volume	Per-image/token can spike
Debuggability	Intermediate text is inspectable	Harder to audit “what it saw”
Latency	Two-step can be slower	One call can be simpler

Common winning pattern: OCR or layout extraction for fields + multimodal or LLM for summary and classification — with one golden eval set covering the full chain.

PM takeaway

Do not treat “multimodal” and “OCR” as rivals. Treat them as layers. Pick extraction fidelity first, reasoning second.

Chapter Summary

Concept	PM understanding
Multimodal	One model interface for multiple input types — not automatic accuracy
Perception vs reasoning	Reading the file and interpreting it are separate failure modes
Vision	Strong for screenshots and semantic image tasks; weak on fine print
Documents	High value; needs schema, citation, validation
Audio	Transcript quality dominates; compliance matters
Code	Propose in LLM; prove in CI
Workflow	Upload → extract → validate → review → SOR
vs OCR	Often combine pipeline extraction with LLM reasoning

Closing Thought

Multimodal is why many AI products finally feel “real” to business users — the model meets them where their work already lives, in files and recordings.

It is also where demos diverge fastest from production. A slick upload button does not replace field validation, source highlighting, or reviewer queues.

As a PM, your job is to name the modality, name the acceptable error rate, and design the human and system checks around perception — before you argue about which logo is on the model card.

Next we look at benchmarks — what public scores actually measure, what they miss, and how to build evaluation your stakeholders should trust more than a leaderboard screenshot.

The real PM lesson

Ingest → Perceive → Reason → Verify. Skip Verify and multimodal becomes an expensive guessing machine.

Chapter navigation

← Previous

Chapter 2: Choosing Between GPT, Claude, Gemini, and Llama — The PM Version

How to compare model families for your use case without chasing leaderboard hype.

Read chapter → Next →

Chapter 4: Benchmarks — What PMs Should and Shouldn't Trust — The PM Version

MMLU, HumanEval, SWE-bench, and Chatbot Arena — what they measure and what your product still needs to test.

Read chapter →

← Chapter 02 Chapter 04 → Back to Module Back to Blog AI Learning