Lesson 18: Vision & Document Analysis

What Claude Can See

Claude's vision capabilities let you pass images and documents alongside text prompts. Claude can read, analyze, and reason about visual content — not just describe it, but draw conclusions, extract data, and answer specific questions about what it sees.

This unlocks a category of tasks that pure text models cannot handle: reviewing a UI screenshot, extracting numbers from a chart, reading a scanned receipt, or analyzing a diagram.

Supported Formats

Format	Notes
JPEG	Best for photos, screenshots
PNG	Best for diagrams, UI, screenshots with text
GIF	Static only — first frame is used
WebP	Supported, good compression
PDF	Analyzed page-by-page or as a whole

Size limits: Images up to 5MB per image. PDFs handled based on page count and size — check current API documentation for limits.

Sending Images via Base64

For images stored locally or fetched from non-public URLs, encode them as base64:

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def analyze_image(image_path: str, prompt: str) -> str:
    image_data = base64.standard_b64encode(Path(image_path).read_bytes()).decode("utf-8")
    
    # Detect media type from extension
    ext = Path(image_path).suffix.lower()
    media_type_map = {".jpg": "image/jpeg", ".jpeg": "image/jpeg",
                      ".png": "image/png", ".gif": "image/gif", ".webp": "image/webp"}
    media_type = media_type_map.get(ext, "image/png")
    
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": image_data,
                    }
                },
                {
                    "type": "text",
                    "text": prompt
                }
            ]
        }]
    )
    return response.content[0].text

Sending Images via URL

For publicly accessible images, pass the URL directly — no base64 encoding needed:

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": "https://example.com/architecture-diagram.png"
                }
            },
            {
                "type": "text",
                "text": "Identify any single points of failure in this architecture diagram."
            }
        ]
    }]
)

PDF Document Analysis

PDFs can be sent as base64-encoded documents. Claude reads the full document and can answer questions, extract information, or summarize across pages:

pdf_data = base64.standard_b64encode(Path("report.pdf").read_bytes()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {
                    "type": "base64",
                    "media_type": "application/pdf",
                    "data": pdf_data,
                }
            },
            {
                "type": "text",
                "text": "Extract all financial figures mentioned and list them in a table with: metric name, value, and the page number where it appears."
            }
        ]
    }]
)

Practical Use Cases

UI Screenshot Review

feedback = analyze_image(
    "screenshot.png",
    "Review this UI for usability issues. For each issue: identify the element, "
    "describe the problem, and suggest a fix. Format as a numbered list."
)

Chart and Graph Extraction

data = analyze_image(
    "quarterly-chart.png",
    "Extract the data from this bar chart. Return a JSON array where each entry "
    "has 'label' and 'value'. If you cannot read an exact value, provide an estimate "
    "and flag it with 'approximate': true."
)

Receipt Processing

receipt = analyze_image(
    "receipt.jpg",
    "Extract: merchant name, date, line items (name + price), subtotal, tax, total. "
    "Return as JSON. If any field is not visible, set it to null."
)

Code Screenshot Analysis

code_review = analyze_image(
    "code-screenshot.png",
    "What language is this? Identify any bugs or issues you can see. "
    "Transcribe the code as text."
)

Multiple Images in One Request

You can include multiple images in a single message. Claude reasons across all of them:

messages=[{
    "role": "user",
    "content": [
        {"type": "image", "source": {"type": "url", "url": before_url}},
        {"type": "image", "source": {"type": "url", "url": after_url}},
        {"type": "text", "text": "What changed between the before (first image) and after (second image)?"}
    ]
}]

Writing Effective Vision Prompts

Be as specific with vision prompts as you are with text prompts. Claude will answer what you ask — not what you hoped for.

# Weak
"What do you see in this image?"

# Strong
"This is a screenshot of a React component. Identify:
1. Any accessibility issues (missing alt text, poor contrast, no focus indicators)
2. Any layout issues visible at this viewport size
3. Any text that appears to be truncated or overflowing
List each issue separately."

Limitations

No image generation — Claude can analyze images, not create them
No real-time video — Video files are not supported; only static frames
Accuracy varies with quality — Blurry, low-contrast, or heavily compressed images produce less reliable results
Handwriting is harder — Printed text is read well; handwriting is less reliable
Token costs for images — Each image consumes tokens; large images at high resolution cost more

Image Token Costs

Images are billed based on their dimensions. Approximate costs:

Image Size	Approximate Tokens
200×200 px	~170 tokens
1000×1000 px	~1,334 tokens
2000×2000 px	~5,334 tokens

For batch processing many images, consider resizing to the minimum resolution that preserves the information you need.

Key Takeaways

Claude supports JPEG, PNG, GIF, WebP for images and PDF for documents
Send images as base64 (for local files) or URL (for public images)
PDFs can be analyzed page-by-page or as a whole document
Multiple images can be included in a single message for comparison tasks
Write specific, structured vision prompts — vagueness produces vague visual analysis
Images cost tokens proportional to their resolution; resize for batch cost control
Claude cannot generate images or process video — analysis only