Lesson 18: Vision & Document Analysis
What Claude Can See
Claude's vision capabilities let you pass images and documents alongside text prompts. Claude can read, analyze, and reason about visual content — not just describe it, but draw conclusions, extract data, and answer specific questions about what it sees.
This unlocks a category of tasks that pure text models cannot handle: reviewing a UI screenshot, extracting numbers from a chart, reading a scanned receipt, or analyzing a diagram.
Supported Formats
| Format | Notes |
|---|---|
| JPEG | Best for photos, screenshots |
| PNG | Best for diagrams, UI, screenshots with text |
| GIF | Static only — first frame is used |
| WebP | Supported, good compression |
| Analyzed page-by-page or as a whole |
Size limits: Images up to 5MB per image. PDFs handled based on page count and size — check current API documentation for limits.
Sending Images via Base64
For images stored locally or fetched from non-public URLs, encode them as base64:
import anthropic
import base64
from pathlib import Path
client = anthropic.Anthropic()
def analyze_image(image_path: str, prompt: str) -> str:
image_data = base64.standard_b64encode(Path(image_path).read_bytes()).decode("utf-8")
# Detect media type from extension
ext = Path(image_path).suffix.lower()
media_type_map = {".jpg": "image/jpeg", ".jpeg": "image/jpeg",
".png": "image/png", ".gif": "image/gif", ".webp": "image/webp"}
media_type = media_type_map.get(ext, "image/png")
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": image_data,
}
},
{
"type": "text",
"text": prompt
}
]
}]
)
return response.content[0].text
Sending Images via URL
For publicly accessible images, pass the URL directly — no base64 encoding needed:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "url",
"url": "https://example.com/architecture-diagram.png"
}
},
{
"type": "text",
"text": "Identify any single points of failure in this architecture diagram."
}
]
}]
)
PDF Document Analysis
PDFs can be sent as base64-encoded documents. Claude reads the full document and can answer questions, extract information, or summarize across pages:
pdf_data = base64.standard_b64encode(Path("report.pdf").read_bytes()).decode("utf-8")
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": pdf_data,
}
},
{
"type": "text",
"text": "Extract all financial figures mentioned and list them in a table with: metric name, value, and the page number where it appears."
}
]
}]
)
Practical Use Cases
UI Screenshot Review
feedback = analyze_image(
"screenshot.png",
"Review this UI for usability issues. For each issue: identify the element, "
"describe the problem, and suggest a fix. Format as a numbered list."
)
Chart and Graph Extraction
data = analyze_image(
"quarterly-chart.png",
"Extract the data from this bar chart. Return a JSON array where each entry "
"has 'label' and 'value'. If you cannot read an exact value, provide an estimate "
"and flag it with 'approximate': true."
)
Receipt Processing
receipt = analyze_image(
"receipt.jpg",
"Extract: merchant name, date, line items (name + price), subtotal, tax, total. "
"Return as JSON. If any field is not visible, set it to null."
)
Code Screenshot Analysis
code_review = analyze_image(
"code-screenshot.png",
"What language is this? Identify any bugs or issues you can see. "
"Transcribe the code as text."
)
Multiple Images in One Request
You can include multiple images in a single message. Claude reasons across all of them:
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "url", "url": before_url}},
{"type": "image", "source": {"type": "url", "url": after_url}},
{"type": "text", "text": "What changed between the before (first image) and after (second image)?"}
]
}]
Writing Effective Vision Prompts
Be as specific with vision prompts as you are with text prompts. Claude will answer what you ask — not what you hoped for.
# Weak
"What do you see in this image?"
# Strong
"This is a screenshot of a React component. Identify:
1. Any accessibility issues (missing alt text, poor contrast, no focus indicators)
2. Any layout issues visible at this viewport size
3. Any text that appears to be truncated or overflowing
List each issue separately."
Limitations
- No image generation — Claude can analyze images, not create them
- No real-time video — Video files are not supported; only static frames
- Accuracy varies with quality — Blurry, low-contrast, or heavily compressed images produce less reliable results
- Handwriting is harder — Printed text is read well; handwriting is less reliable
- Token costs for images — Each image consumes tokens; large images at high resolution cost more
Image Token Costs
Images are billed based on their dimensions. Approximate costs:
| Image Size | Approximate Tokens |
|---|---|
| 200×200 px | ~170 tokens |
| 1000×1000 px | ~1,334 tokens |
| 2000×2000 px | ~5,334 tokens |
For batch processing many images, consider resizing to the minimum resolution that preserves the information you need.
Key Takeaways
- Claude supports JPEG, PNG, GIF, WebP for images and PDF for documents
- Send images as base64 (for local files) or URL (for public images)
- PDFs can be analyzed page-by-page or as a whole document
- Multiple images can be included in a single message for comparison tasks
- Write specific, structured vision prompts — vagueness produces vague visual analysis
- Images cost tokens proportional to their resolution; resize for batch cost control
- Claude cannot generate images or process video — analysis only