Lesson 26: Vision & Document Analysis

What Claude Can See

Claude can read, analyze, and reason about images and documents alongside text. This goes beyond simple description — Claude draws conclusions, extracts structured data, compares visuals, and answers specific questions about what it sees.

This unlocks a category of tasks that pure text models cannot handle: reviewing a UI screenshot, extracting numbers from a chart, reading a scanned receipt, implementing a design from a mockup, or debugging from an error screenshot.

Supported Image Formats

Format	Best For
JPEG	Photos, screenshots
PNG	Diagrams, UI, screenshots with text
GIF	Static frames only — animated GIFs use the first frame
WebP	Good compression, broadly supported

Size Limits and Constraints

Constraint	Limit
Max file size (API)	5MB per image
Max file size (claude.ai)	10MB per image
Max dimensions	8000 x 8000 px
Max dimensions (>20 images)	2000 x 2000 px
Max images per API request	100
Max images per claude.ai turn	20
Overall API request size	32MB

Images larger than 1568 px on their longest edge are automatically scaled down (preserving aspect ratio) before processing. This scaling adds latency without improving results, so resize before sending when possible.

Very small images (under 200 px on any edge) may degrade performance.

Using Images in Claude Code

Claude Code supports images directly in the terminal. There are three ways to include them:

Drag and drop — Drag an image file into the Claude Code terminal window.

Clipboard paste — Copy an image (or take a screenshot), then paste with Ctrl+V. Note: use Ctrl+V, not Cmd+V, even on macOS.

File path — Reference an image path in your message:

Analyze this image: /path/to/screenshot.png

Claude Code reads the image inline and you can ask questions about it, request code based on it, or use it as context for debugging.

Tip: When Claude references images in its response (e.g., [Image #1]), you can Cmd+Click (Mac) or Ctrl+Click (Windows/Linux) the link to open the image in your default viewer.

How Image Tokens Are Calculated

Every image counts toward your token usage. The formula for images that do not need resizing:

tokens = (width × height) / 750

If an image's long edge exceeds 1568 px (or exceeds ~1.15 megapixels), it is scaled down first, and the token count is based on the scaled dimensions.

Here are examples at common sizes:

Image Size	Approx. Tokens
200 x 200 px (0.04 MP)	~54
1000 x 1000 px (1 MP)	~1,334
1092 x 1092 px (1.19 MP)	~1,590

The cost per image depends on the model. Smaller, cheaper models make batch image processing significantly more affordable — Haiku is roughly 60x cheaper per token than Opus.

For batch processing, resize images to the minimum resolution that preserves the information you need. A 1092 x 1092 image and a 2000 x 2000 image produce the same results (the larger one gets scaled down), but the larger one adds latency.

PDF Document Support

Claude can read PDF documents page by page, analyzing both text and visual content (charts, diagrams, tables, images embedded in pages).

PDF Limits

Constraint	Limit
Max pages per request	100
Max request size	32MB (entire payload)
Format	Standard PDF (no passwords or encryption)

How PDF Processing Works

Each page is converted into an image
Text is extracted from each page and provided alongside the image
Claude analyzes both text and visual content together

PDF Token Costs

Each page incurs two types of token costs:

Text tokens: 1,500 -- 3,000 tokens per page depending on content density
Image tokens: Standard image-based cost per page (same formula as above)

Sending a PDF via the API

PDFs use the document content type (not image):

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

pdf_data = base64.standard_b64encode(
    Path("report.pdf").read_bytes()
).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-6",  # Check docs.anthropic.com for latest model IDs
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {
                    "type": "base64",
                    "media_type": "application/pdf",
                    "data": pdf_data,
                }
            },
            {
                "type": "text",
                "text": "Extract all financial figures and list them in a table."
            }
        ]
    }]
)

PDFs can also be sent via URL or the Files API — see the API documentation for details.

Sending Images via the API

Base64-Encoded Images

For local images or images from non-public URLs, encode as base64:

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def analyze_image(image_path: str, prompt: str) -> str:
    image_data = base64.standard_b64encode(
        Path(image_path).read_bytes()
    ).decode("utf-8")

    # Detect media type from extension
    ext = Path(image_path).suffix.lower()
    media_types = {
        ".jpg": "image/jpeg", ".jpeg": "image/jpeg",
        ".png": "image/png", ".gif": "image/gif",
        ".webp": "image/webp"
    }
    media_type = media_types.get(ext, "image/png")

    response = client.messages.create(
        model="claude-sonnet-4-6",  # Check docs.anthropic.com for latest model IDs
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": image_data,
                    }
                },
                {
                    "type": "text",
                    "text": prompt
                }
            ]
        }]
    )
    return response.content[0].text

URL-Based Images

For publicly accessible images, pass the URL directly — no encoding needed:

response = client.messages.create(
    model="claude-sonnet-4-6",  # Check docs.anthropic.com for latest model IDs
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": "https://example.com/architecture-diagram.png"
                }
            },
            {
                "type": "text",
                "text": "Identify any single points of failure in this architecture."
            }
        ]
    }]
)

Multiple Images in One Request

Include multiple images for comparison or multi-image reasoning. Label them for clarity:

messages=[{
    "role": "user",
    "content": [
        {"type": "text", "text": "Image 1:"},
        {"type": "image", "source": {"type": "url", "url": before_url}},
        {"type": "text", "text": "Image 2:"},
        {"type": "image", "source": {"type": "url", "url": after_url}},
        {"type": "text", "text": "What changed between these two screenshots?"}
    ]
}]

Practical Use Cases

Debugging from Screenshots

Share an error screenshot in Claude Code and ask for diagnosis:

Here's a screenshot of the error: /path/to/error.png
What's causing this and how do I fix it?

Or via the API:

feedback = analyze_image(
    "error-screenshot.png",
    "This is a browser console screenshot. Identify the error, explain the root cause, "
    "and suggest a fix. Include the specific file and line if visible."
)

Implementing Designs from Mockups

code = analyze_image(
    "design-mockup.png",
    "Generate the HTML and CSS to implement this UI design. Use flexbox for layout. "
    "Match the colors, spacing, and typography as closely as possible."
)

Reviewing UI for Issues

review = analyze_image(
    "screenshot.png",
    "Review this UI for usability issues. For each issue:\n"
    "1. Identify the element\n"
    "2. Describe the problem\n"
    "3. Suggest a fix\n"
    "Check for: accessibility, contrast, truncation, alignment, responsive layout."
)

Extracting Data from Charts

data = analyze_image(
    "quarterly-chart.png",
    "Extract the data from this bar chart. Return a JSON array where each entry "
    "has 'label' and 'value'. If you cannot read an exact value, estimate it "
    "and flag it with 'approximate': true."
)

Reading Architecture Diagrams

analysis = analyze_image(
    "system-diagram.png",
    "Describe this system architecture. List all components, their connections, "
    "and the data flow. Identify any potential bottlenecks or failure points."
)

Writing Effective Vision Prompts

Be as specific with vision prompts as you are with text prompts. Vague prompts produce vague analysis.

# Weak
"What do you see in this image?"

# Strong
"This is a screenshot of a React component. Identify:
1. Any accessibility issues (missing alt text, poor contrast, no focus indicators)
2. Any layout issues visible at this viewport size
3. Any text that appears truncated or overflowing
List each issue separately with the affected element."

Image placement matters. Claude performs best when images come before the text that refers to them. Place images early in the message, followed by your questions or instructions.

Best Practices

Resize before sending. If your image is larger than 1568 px on the long edge, Claude scales it down anyway. Resize on your end to avoid wasted latency. Target ~1.15 megapixels (e.g., 1092 x 1092) for the sweet spot of quality vs. cost.

Use PNG for text-heavy images. JPEG compression can blur text. If the image contains code, error messages, or UI text that Claude needs to read, prefer PNG.

Use JPEG for photos. For natural images where exact text fidelity is not critical, JPEG offers much smaller file sizes with minimal quality loss.

Prefer text when available. If you have the actual error log, code file, or structured data, send it as text rather than a screenshot. Text is cheaper (fewer tokens), faster, and more reliably parsed. Use images when the visual context matters — layout, design, charts, spatial relationships.

Label multiple images. When sending multiple images, prefix each with Image 1:, Image 2:, etc. so Claude can reference them unambiguously.

Cache PDFs for repeated queries. If you're asking multiple questions about the same PDF, use prompt caching to avoid re-processing the document each time:

{
    "type": "document",
    "source": {
        "type": "base64",
        "media_type": "application/pdf",
        "data": pdf_data,
    },
    "cache_control": {"type": "ephemeral"}
}

Limitations

No image generation — Claude analyzes images but cannot create, edit, or manipulate them
No video — Video files are not supported; extract individual frames instead
No people identification — Claude cannot name or identify specific individuals in images
Spatial reasoning is limited — Claude may struggle with precise localization, exact positions, or reading analog clock faces
Counting is approximate — Large numbers of small objects may not be counted precisely
Handwriting is less reliable — Printed text is read well; handwriting accuracy varies with legibility
Quality affects accuracy — Blurry, low-contrast, rotated, or heavily compressed images produce less reliable results
No metadata access — Claude does not read EXIF data or other image metadata
AI-generated image detection — Claude cannot reliably determine whether an image is AI-generated

Key Takeaways

Claude supports JPEG, PNG, GIF, and WebP images with a 5MB per image limit on the API
Token cost formula: (width x height) / 750 — resize to ~1.15 megapixels to optimize cost and latency
In Claude Code, use drag-and-drop, clipboard paste (Ctrl+V), or file paths to share images
PDFs support up to 100 pages per request and are processed as both text and images per page
Send images via base64 (local files) or URL (public images) in the API
Place images before text in your messages for best results
Use specific, structured prompts — vague questions produce vague visual analysis
Prefer text over screenshots when the raw data is available; use images when visual context matters