Lesson 15: Prompt Caching: 90% Cost Reduction

What Is Prompt Caching?

Every time you send a message to the Claude API, every token in your prompt is processed. For long system prompts, large reference documents, or few-shot example sets that you send with every request, you're paying to process the same tokens over and over.

Prompt caching lets you mark stable sections of your prompt so that Claude only processes them once. On subsequent requests, those cached sections are reused at a fraction of the cost.

The numbers:

Cache write: ~25% more expensive than a normal input token (one-time cost to populate the cache)
Cache read: ~10% of the normal input token price

If you're sending a 10,000-token system prompt with every request, and most requests hit the cache, you're paying 1/10th the normal price for those tokens.

How It Works

When you include cache_control in your prompt, Claude stores a processed snapshot of that content for 5 minutes. Any request made within that window that includes the same content up to that cache point will be served from cache.

The cache is keyed on the exact content of the prompt up to the cache breakpoint. If that content changes — even by one token — it's a cache miss and the full content is re-processed.

Important: The TTL is 5 minutes by default. You can extend it with cache_control: {"type": "ephemeral"} for the standard 5-minute window, or use longer-lived caching options if available on your tier.

The `cache_control` Parameter

Mark content for caching by adding cache_control to a content block:

import anthropic

client = anthropic.Anthropic()

# A large, stable system prompt
SYSTEM_PROMPT = """You are an expert code reviewer specializing in Python and 
security vulnerabilities. You have deep knowledge of OWASP Top 10, common 
injection patterns, and secure coding practices...
[... imagine 5000 more tokens here ...]
"""

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # Mark for caching
        }
    ],
    messages=[
        {"role": "user", "content": "Review this function for SQL injection risks: ..."}
    ]
)

# Check cache usage in response
print(response.usage.cache_creation_input_tokens)  # Tokens written to cache
print(response.usage.cache_read_input_tokens)       # Tokens read from cache

On the first call: cache_creation_input_tokens > 0, cache_read_input_tokens = 0 (cache miss — content stored).

On subsequent calls within 5 minutes: cache_creation_input_tokens = 0, cache_read_input_tokens > 0 (cache hit — cheap!).

What to Cache

Not everything is worth caching. Cache content that is:

Stable across requests:

System prompts and persona definitions
Reference documentation (API specs, style guides)
Large few-shot example sets
Legal disclaimers or compliance text

Long enough to matter:

Minimum cacheable length is ~1,024 tokens (varies by model)
Caching a 200-token system prompt saves almost nothing
Caching a 10,000-token document saves a lot

Don't cache:

The user's message (changes every request)
Dynamic context like timestamps or session state
Very short content below the minimum threshold

Cost Math: Real Example

Suppose you have:

System prompt: 8,000 tokens
Per-request user message: 200 tokens
Response: 500 tokens
Volume: 1,000 requests/day on Claude Sonnet

Without caching:

Input:  8,200 tokens × 1,000 × $3/M = $24.60/day
Output:   500 tokens × 1,000 × $15/M = $7.50/day
Total: $32.10/day  →  ~$963/month

With caching (assume 95% hit rate):

Cache write (5% of requests):
  50 × 8,000 tokens × $3.75/M = $1.50

Cache read (95% of requests):
  950 × 8,000 tokens × $0.30/M = $2.28

User message (all requests, not cached):
  1,000 × 200 tokens × $3/M = $0.60

Output (all requests):
  1,000 × 500 tokens × $15/M = $7.50

Total: $11.88/day  →  ~$356/month

Savings: ~63% on total cost, ~90% on the cached input portion.

Structuring Prompts for Maximum Cache Hits

The cache is keyed on the prefix of your prompt. Structure your prompts with stable content first, dynamic content last:

# GOOD: Stable content first, dynamic content last
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": LARGE_STABLE_DOCUMENT,  # Cached
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": f"Given the above, answer this: {user_question}"  # Dynamic, not cached
            }
        ]
    }
]

# BAD: Dynamic content before stable content breaks cache
messages = [
    {
        "role": "user",
        "content": f"Answer this: {user_question}\n\n{LARGE_STABLE_DOCUMENT}"
        # Cache never hits because user_question changes
    }
]

Multiple Cache Breakpoints

You can have up to 4 cache breakpoints in a single request. This lets you cache multiple stable sections independently:

system = [
    {
        "type": "text",
        "text": PERSONA_AND_RULES,      # Stable: persona
        "cache_control": {"type": "ephemeral"}
    },
    {
        "type": "text",
        "text": LARGE_REFERENCE_DOC,    # Stable: reference material
        "cache_control": {"type": "ephemeral"}
    },
    {
        "type": "text",
        "text": dynamic_session_context  # Dynamic: no cache_control
    }
]

Limitations

Minimum length: Content must exceed ~1,024 tokens to be cacheable
TTL: 5-minute default; content not requested within the window is evicted
Exact match: Any change to the cached portion is a cache miss
Model-specific: Cache is per-model; switching models invalidates the cache

Key Takeaways

Prompt caching cuts input token costs by up to 90% for stable, repeated content
Mark content for caching with cache_control: {"type": "ephemeral"}
Cache writes cost ~25% more; cache reads cost ~10% of normal input pricing
Put stable content first, dynamic content last — cache is prefix-keyed
The 5-minute TTL means caching works best for high-throughput applications
Track cache_read_input_tokens in your responses to measure hit rate