Lesson 21: Cost Monitoring & Optimization

Using Claude effectively means knowing where your tokens go. This lesson covers how to track usage, analyze costs per task, and apply optimization strategies that can dramatically reduce your spending without sacrificing quality.

Token Counting Basics

Every API call consumes two types of tokens:

Input tokens — your system prompt, conversation history, and user message
Output tokens — Claude's response (these cost more, typically 3-5x input tokens)

A rough rule of thumb: 1 token ≈ 4 characters of English text, or about ¾ of a word. A 2,000-word essay is roughly 2,700 tokens.

Cost hierarchy: Haiku is ~60x cheaper than Opus per token. Sonnet sits in between. Always check the pricing page for current rates — prices change as models evolve.

Tracking Usage from API Responses

Every API response includes a usage object with exact token counts. This is your primary data source for cost tracking.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",  # Check docs.anthropic.com for latest model IDs
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain recursion in two sentences."}],
)

# Every response includes usage data
print(f"Input tokens:  {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Model:         {response.model}")

If you're using prompt caching, the usage object also includes cache_creation_input_tokens and cache_read_input_tokens, which have different cost rates.

Building a Cost Tracker

Wrap your API calls to automatically log usage. This example stores data in a simple list, but in production you'd write to a database or logging service.

import anthropic
from datetime import datetime

client = anthropic.Anthropic()

usage_log = []

def tracked_call(model: str, messages: list, **kwargs) -> str:
    """Make an API call and log token usage."""

    response = client.messages.create(
        model=model,
        messages=messages,
        **kwargs,
    )

    usage_log.append({
        "timestamp": datetime.now().isoformat(),
        "model": model,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "cache_read": getattr(response.usage, "cache_read_input_tokens", 0),
        "cache_write": getattr(response.usage, "cache_creation_input_tokens", 0),
    })

    return response.content[0].text

def print_usage_summary():
    """Print aggregate usage statistics."""
    from collections import defaultdict
    by_model = defaultdict(lambda: {"calls": 0, "input": 0, "output": 0})
    for entry in usage_log:
        m = by_model[entry["model"]]
        m["calls"] += 1
        m["input"] += entry["input_tokens"]
        m["output"] += entry["output_tokens"]

    print(f"\n{'Model':<35} {'Calls':>6} {'Input':>10} {'Output':>10}")
    print("-" * 65)
    for model, stats in by_model.items():
        print(f"{model:<35} {stats['calls']:>6} {stats['input']:>10,} {stats['output']:>10,}")

Cost-Per-Task Analysis

Knowing your total spend isn't enough — you need to know which tasks are expensive. Tag each call with a task category.

def tracked_call_with_task(model: str, messages: list, task: str, **kwargs) -> str:
    """Track usage broken down by task type."""
    response = client.messages.create(model=model, messages=messages, **kwargs)

    usage_log.append({
        "task": task,
        "model": model,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
    })
    return response.content[0].text

# Usage
tracked_call_with_task("claude-haiku-4-20250514", msgs, task="classification", max_tokens=50)
tracked_call_with_task("claude-sonnet-4-20250514", msgs, task="summarization", max_tokens=2048)

This reveals optimization opportunities: if "classification" accounts for 60% of calls but uses Sonnet, switching to Haiku for that task could cut costs dramatically.

Optimization Strategies

1. Trim Your Prompts

Every token in your system prompt is charged on every single call. Remove verbose instructions. Use concise formatting. A system prompt that goes from 800 tokens to 300 tokens saves 500 input tokens per request — at scale, this adds up fast.

2. Use Prompt Caching

Repeated system prompts and long context documents should use prompt caching. Cached reads cost ~90% less than fresh input tokens. See Lesson 15: Prompt Caching for details.

3. Batch When Possible

The Batch API processes requests asynchronously at a 50% discount. If your use case tolerates delays (data pipelines, nightly jobs), batch everything. See Lesson 29: Batch API.

4. Choose the Right Model

Don't default to the most powerful model. Start with Haiku, move to Sonnet if quality is insufficient. Only use Opus for tasks that genuinely require it. See Lesson 26: Multi-Model Pipelines.

5. Limit Output Tokens

Set max_tokens to a reasonable ceiling for each task. A classification task doesn't need max_tokens=4096 — set it to 50.

6. Reduce Conversation Length

Long conversation histories mean ballooning input tokens. Summarize earlier turns, or use a sliding window that keeps only recent messages plus a summary.

Claude Code Usage Tracking

When working with Claude Code, you can monitor token usage directly:

# Set a token budget for a session
claude --token-budget 50000

# Check usage in conversation
# Type /cost during a session to see current usage

Claude Code displays token usage after each interaction, showing both input and output token counts. Watch these numbers to understand which operations are expensive — file reads and large codebases consume significant input tokens.

Setting Budget Alerts

For production systems, implement spending checks in your cost tracker:

DAILY_BUDGET_TOKENS = 5_000_000  # Set your own threshold

def check_budget():
    """Warn if daily usage exceeds budget."""
    from datetime import date
    today = date.today().isoformat()
    today_tokens = sum(
        e["input_tokens"] + e["output_tokens"]
        for e in usage_log
        if e["timestamp"].startswith(today)
    )
    if today_tokens > DAILY_BUDGET_TOKENS:
        print(f"⚠ BUDGET ALERT: {today_tokens:,} tokens used today "
              f"(budget: {DAILY_BUDGET_TOKENS:,})")
        return False
    return True

On Anthropic's console, you can also set spending limits at the organization level, which act as a hard stop if your code has a runaway loop.

Key Takeaways

Every API response includes exact token counts — use them, don't guess
Track usage per model and per task to find your biggest optimization opportunities
Prompt trimming, caching, batching, and model selection are the four biggest cost levers
Set max_tokens appropriately for each task — don't leave it at the maximum
Implement budget alerts to catch runaway costs before they become a problem
Claude Code shows usage per interaction — use /cost to monitor your session