# Exercise 05: Prompt Caching in Practice

## Goal
Restructure real prompts to maximize cache hit rates, calculate the cost savings, and implement a caching strategy for a multi-turn chat application.

## Background
Prompt caching saves up to 90% on input token costs for stable content. The key insight: stable content must come before dynamic content in the prompt. The cache is keyed on the exact prefix -- any change to cached content invalidates it.

---

## Part A: Restructure for Cache Hits

Below is a poorly structured prompt that achieves zero cache hits despite having a large stable section. Your task is to restructure it to maximize caching.

**Original (uncacheable structure):**

The function below sends dynamic content before the 8,000-token stable API docs, so every request is a cache miss.
Every call is a full cache miss because the dynamic prefix (username, timestamp) changes each request and invalidates the prefix match.

**Problems:**
1. Dynamic content appears before the stable 8,000-token document
2. Everything is in one string -- no cache_control markers are possible
3. Every request is a full cache miss regardless of whether the docs changed

**Your task:** Rewrite build_prompt to use a separate system parameter for the API docs, mark the stable content with cache_control, and put the dynamic message in the messages array.
The function should return a dict of kwargs ready for client.messages.create(**kwargs).

Starter code:

    import anthropic
    
    API_DOCS = "[... 8,000 tokens of API docs and standards ...]"
    
    def build_cacheable_prompt(user_question, username, timestamp):
        # Returns kwargs dict for client.messages.create(**kwargs)
        # TODO: implement - keys: model, max_tokens, system, messages
        pass
    
    result = build_cacheable_prompt("How do I authenticate?", "alice", "2026-02-22")
    assert "system" in result
    assert any(
        isinstance(b, dict) and b.get("cache_control") is not None
        for b in result["system"]
    ), "System content should have cache_control"
    print("Structure looks correct\!")

---

## Part B: Cost Calculation

You run a developer documentation assistant with these stats:

- System prompt (API docs + style guide): **12,000 tokens**
- Average user question: **150 tokens**
- Average response: **600 tokens**
- Requests per day: **2,000**
- Model: Claude Sonnet (\.00 / 1M input, .00 / 1M output)
- Cache read price: **\/usr/bin/bash.30 / 1M** tokens (10% of normal input price)
- Cache write price: **\.75 / 1M** tokens (125% of input, one-time per 5 min window)

**Calculate the following:**

1. **Daily cost without caching:**
   - Input: (12,000 + 150) x 2,000 x \.00/M = ?
   - Output: 600 x 2,000 x .00/M = ?
   - Total daily cost: ?

2. **Daily cost with caching (assume 95% cache hit rate):**
   - Cache writes (5% of 2,000 = 100 requests): 100 x 12,000 x \.75/M = ?
   - Cache reads (95% of 2,000 = 1,900 requests): 1,900 x 12,000 x \/usr/bin/bash.30/M = ?
   - Dynamic input (2,000 x 150 x \.00/M): ?
   - Output (2,000 x 600 x .00/M): ?
   - Total daily cost with caching: ?

3. **Monthly savings** (30 days): ?

4. **At what daily request volume does caching break even?**
   Cache writes cost 25% more than normal input tokens -- at very low volume you can spend more on writes than you save on reads.

Show your calculations:

    [your calculations here]

---

## Part C: Chat Application Caching Strategy

You are building a customer support chatbot with these per-conversation components:
- Static system prompt: 3,000 tokens (role, rules, output format)
- Product knowledge base: 15,000 tokens (changes weekly)
- Per-user context: 200 tokens (account info, injected at session start)
- Conversation history: grows from 0 to ~5,000 tokens over a session

Design a caching strategy. For each content block, decide:
- Should it be cached? Why or why not?
- Where should it appear in the prompt (system vs messages, and in what order)?
- What is the cache TTL concern (does it change faster or slower than the 5-minute TTL)?

Fill in this table:

| Content Block | Cache? | Position | Frequency of Change | Notes |
|---------------|--------|----------|---------------------|-------|
| System prompt (role + rules) | | | | |
| Product knowledge base | | | | |
| Per-user account context | | | | |
| Conversation history | | | | |
| Current user message | | | | |

Then implement a ChatSession class that uses this caching strategy.
The send_message method should:
1. Build the system list with cache_control on stable sections
2. Build messages from conversation_history plus the new user message
3. Call the API
4. Append the assistant response to conversation_history
5. Return the response text

The get_cache_stats method should extract cache_read_input_tokens, cache_creation_input_tokens, and input_tokens from the response usage object.

---

## Part D: Measuring Cache Effectiveness

After implementing caching, verify it is working. Write a measure_cache_hit_rate function that:
1. Accepts a list of prompts, a system_content string, and a client
2. Sends each prompt to Claude with cache_control on system_content
3. Tracks whether each request was a cache hit (cache_read_input_tokens > 0)
4. Returns a dict with: total_requests, cache_hits, cache_misses, hit_rate, total_cache_read_tokens, total_cache_write_tokens

Use model claude-haiku-4-5 and max_tokens=100 to keep costs low.
Add a 100ms sleep between requests to avoid rate limits.

Run the function with at least 10 prompts against the real API.
If your hit rate is below 80%, list the possible causes.

---

## Success Criteria

- [ ] Part A: build_cacheable_prompt implemented and passes the structure assertion
- [ ] Part B: All cost calculations completed with correct arithmetic
- [ ] Part C: Caching strategy table filled in with reasoning for each decision
- [ ] Part C: ChatSession.send_message implemented with correct cache structure
- [ ] Part D: measure_cache_hit_rate run against real API with hit rate > 80%

## Reflection Questions

- What surprised you about how quickly cost savings compound with caching?
- What is the risk of caching content that changes more frequently than you expect?
- How would you handle cache invalidation if the knowledge base is updated mid-day?
