On this page
Lesson 21 of 30

Lesson 20: Prompt Caching: 90% Cost Reduction

What Is Prompt Caching?

Every time you send a message to the Claude API, every token in your prompt is processed. For long system prompts, large reference documents, or few-shot example sets that you send with every request, you're paying to process the same tokens over and over.

Prompt caching lets you mark stable sections of your prompt so that Claude only processes them once. On subsequent requests, those cached sections are reused at a fraction of the cost.

The numbers:

  • Cache write: ~25% more expensive than a normal input token (one-time cost to populate the cache)
  • Cache read: ~10% of the normal input token price

If you're sending a 10,000-token system prompt with every request, and most requests hit the cache, you're paying 1/10th the normal price for those tokens.


How It Works

When you include cache_control in your prompt, Claude stores a processed snapshot of that content for 5 minutes. Any request made within that window that includes the same content up to that cache point will be served from cache.

The cache is keyed on the exact content of the prompt up to the cache breakpoint. If that content changes — even by one token — it's a cache miss and the full content is re-processed.

Important: The default TTL is 5 minutes. The standard cache_control: {"type": "ephemeral"} gives you this 5-minute window. For higher-throughput use cases, an extended TTL option is available that can keep cached content alive longer -- check the API documentation for current options and tier availability.


The `cache_control` Parameter

Mark content for caching by adding cache_control to a content block:

Python
import anthropic

client = anthropic.Anthropic()

# A large, stable system prompt
SYSTEM_PROMPT = """You are an expert code reviewer specializing in Python and 
security vulnerabilities. You have deep knowledge of OWASP Top 10, common 
injection patterns, and secure coding practices...
[... imagine 5000 more tokens here ...]
"""

response = client.messages.create(
    model="claude-sonnet-4-6",  # Check docs.anthropic.com for latest model IDs
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # Mark for caching
        }
    ],
    messages=[
        {"role": "user", "content": "Review this function for SQL injection risks: ..."}
    ]
)

# Check cache usage in response
print(response.usage.cache_creation_input_tokens)  # Tokens written to cache
print(response.usage.cache_read_input_tokens)       # Tokens read from cache

On the first call: cache_creation_input_tokens > 0, cache_read_input_tokens = 0 (cache miss — content stored).

On subsequent calls within 5 minutes: cache_creation_input_tokens = 0, cache_read_input_tokens > 0 (cache hit — cheap!).


What to Cache

Not everything is worth caching. Cache content that is:

Stable across requests:

  • System prompts and persona definitions
  • Reference documentation (API specs, style guides)
  • Large few-shot example sets
  • Legal disclaimers or compliance text

Long enough to matter:

  • Minimum cacheable length is 1,024--2,048 tokens depending on the model
  • Caching a 200-token system prompt saves almost nothing
  • Caching a 10,000-token document saves a lot

Don't cache:

  • The user's message (changes every request)
  • Dynamic context like timestamps or session state
  • Very short content below the minimum threshold

Cost Math: Relative Example

Suppose you have:

  • System prompt: 8,000 tokens (stable, cached)
  • Per-request user message: 200 tokens (dynamic, not cached)
  • Response: 500 tokens
  • Volume: 1,000 requests/day

Without caching: You pay full input price for all 8,200 tokens on every request.

With caching (assume 95% hit rate):

  • 5% of requests pay the cache write cost (~1.25x normal input price) for the 8,000-token system prompt
  • 95% of requests pay the cache read cost (~0.1x normal input price) for the 8,000-token system prompt
  • All requests pay full price for the 200-token user message and all output tokens

Result: ~63% savings on total cost, ~90% savings on the cached input portion. At high volume, this is substantial.

Note: Check anthropic.com/pricing for current per-token rates. The relative discounts (cache reads at ~10% of normal input cost, cache writes at ~125%) have been consistent across model generations.


Structuring Prompts for Maximum Cache Hits

The cache is keyed on the prefix of your prompt. Structure your prompts with stable content first, dynamic content last:

Python
# GOOD: Stable content first, dynamic content last
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": LARGE_STABLE_DOCUMENT,  # Cached
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": f"Given the above, answer this: {user_question}"  # Dynamic, not cached
            }
        ]
    }
]

# BAD: Dynamic content before stable content breaks cache
messages = [
    {
        "role": "user",
        "content": f"Answer this: {user_question}\n\n{LARGE_STABLE_DOCUMENT}"
        # Cache never hits because user_question changes
    }
]

Multiple Cache Breakpoints

You can have up to 4 cache breakpoints in a single request. This lets you cache multiple stable sections independently:

Python
system = [
    {
        "type": "text",
        "text": PERSONA_AND_RULES,      # Stable: persona
        "cache_control": {"type": "ephemeral"}
    },
    {
        "type": "text",
        "text": LARGE_REFERENCE_DOC,    # Stable: reference material
        "cache_control": {"type": "ephemeral"}
    },
    {
        "type": "text",
        "text": dynamic_session_context  # Dynamic: no cache_control
    }
]

Limitations

  • Minimum length: Content must exceed a minimum token count to be cacheable -- typically 1,024 tokens for most models, but up to 2,048 tokens for some (check model-specific documentation)
  • TTL: 5-minute default; content not requested within the window is evicted. Extended TTL options may be available.
  • Breakpoint limit: Maximum 4 cache breakpoints per request
  • Exact match: Any change to the cached portion is a cache miss
  • Model-specific: Cache is per-model; switching models invalidates the cache

Key Takeaways

  • Prompt caching cuts input token costs by up to 90% for stable, repeated content
  • Mark content for caching with cache_control: {"type": "ephemeral"}
  • Cache writes cost ~25% more; cache reads cost ~10% of normal input pricing
  • Put stable content first, dynamic content last — cache is prefix-keyed
  • The 5-minute TTL means caching works best for high-throughput applications
  • Track cache_read_input_tokens in your responses to measure hit rate