Lesson 15: Prompt Caching: 90% Cost Reduction
What Is Prompt Caching?
Every time you send a message to the Claude API, every token in your prompt is processed. For long system prompts, large reference documents, or few-shot example sets that you send with every request, you're paying to process the same tokens over and over.
Prompt caching lets you mark stable sections of your prompt so that Claude only processes them once. On subsequent requests, those cached sections are reused at a fraction of the cost.
The numbers:
- Cache write: ~25% more expensive than a normal input token (one-time cost to populate the cache)
- Cache read: ~10% of the normal input token price
If you're sending a 10,000-token system prompt with every request, and most requests hit the cache, you're paying 1/10th the normal price for those tokens.
How It Works
When you include cache_control in your prompt, Claude stores a processed snapshot of that content for 5 minutes. Any request made within that window that includes the same content up to that cache point will be served from cache.
The cache is keyed on the exact content of the prompt up to the cache breakpoint. If that content changes — even by one token — it's a cache miss and the full content is re-processed.
Important: The TTL is 5 minutes by default. You can extend it with
cache_control: {"type": "ephemeral"}for the standard 5-minute window, or use longer-lived caching options if available on your tier.
The `cache_control` Parameter
Mark content for caching by adding cache_control to a content block:
import anthropic
client = anthropic.Anthropic()
# A large, stable system prompt
SYSTEM_PROMPT = """You are an expert code reviewer specializing in Python and
security vulnerabilities. You have deep knowledge of OWASP Top 10, common
injection patterns, and secure coding practices...
[... imagine 5000 more tokens here ...]
"""
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Mark for caching
}
],
messages=[
{"role": "user", "content": "Review this function for SQL injection risks: ..."}
]
)
# Check cache usage in response
print(response.usage.cache_creation_input_tokens) # Tokens written to cache
print(response.usage.cache_read_input_tokens) # Tokens read from cache
On the first call: cache_creation_input_tokens > 0, cache_read_input_tokens = 0 (cache miss — content stored).
On subsequent calls within 5 minutes: cache_creation_input_tokens = 0, cache_read_input_tokens > 0 (cache hit — cheap!).
What to Cache
Not everything is worth caching. Cache content that is:
Stable across requests:
- System prompts and persona definitions
- Reference documentation (API specs, style guides)
- Large few-shot example sets
- Legal disclaimers or compliance text
Long enough to matter:
- Minimum cacheable length is ~1,024 tokens (varies by model)
- Caching a 200-token system prompt saves almost nothing
- Caching a 10,000-token document saves a lot
Don't cache:
- The user's message (changes every request)
- Dynamic context like timestamps or session state
- Very short content below the minimum threshold
Cost Math: Real Example
Suppose you have:
- System prompt: 8,000 tokens
- Per-request user message: 200 tokens
- Response: 500 tokens
- Volume: 1,000 requests/day on Claude Sonnet
Without caching:
Input: 8,200 tokens × 1,000 × $3/M = $24.60/day
Output: 500 tokens × 1,000 × $15/M = $7.50/day
Total: $32.10/day → ~$963/month
With caching (assume 95% hit rate):
Cache write (5% of requests):
50 × 8,000 tokens × $3.75/M = $1.50
Cache read (95% of requests):
950 × 8,000 tokens × $0.30/M = $2.28
User message (all requests, not cached):
1,000 × 200 tokens × $3/M = $0.60
Output (all requests):
1,000 × 500 tokens × $15/M = $7.50
Total: $11.88/day → ~$356/month
Savings: ~63% on total cost, ~90% on the cached input portion.
Structuring Prompts for Maximum Cache Hits
The cache is keyed on the prefix of your prompt. Structure your prompts with stable content first, dynamic content last:
# GOOD: Stable content first, dynamic content last
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": LARGE_STABLE_DOCUMENT, # Cached
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": f"Given the above, answer this: {user_question}" # Dynamic, not cached
}
]
}
]
# BAD: Dynamic content before stable content breaks cache
messages = [
{
"role": "user",
"content": f"Answer this: {user_question}\n\n{LARGE_STABLE_DOCUMENT}"
# Cache never hits because user_question changes
}
]
Multiple Cache Breakpoints
You can have up to 4 cache breakpoints in a single request. This lets you cache multiple stable sections independently:
system = [
{
"type": "text",
"text": PERSONA_AND_RULES, # Stable: persona
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": LARGE_REFERENCE_DOC, # Stable: reference material
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": dynamic_session_context # Dynamic: no cache_control
}
]
Limitations
- Minimum length: Content must exceed ~1,024 tokens to be cacheable
- TTL: 5-minute default; content not requested within the window is evicted
- Exact match: Any change to the cached portion is a cache miss
- Model-specific: Cache is per-model; switching models invalidates the cache
Key Takeaways
- Prompt caching cuts input token costs by up to 90% for stable, repeated content
- Mark content for caching with
cache_control: {"type": "ephemeral"} - Cache writes cost ~25% more; cache reads cost ~10% of normal input pricing
- Put stable content first, dynamic content last — cache is prefix-keyed
- The 5-minute TTL means caching works best for high-throughput applications
- Track
cache_read_input_tokensin your responses to measure hit rate