Lesson 21: Cost Monitoring & Optimization
Using Claude effectively means knowing where your tokens go. This lesson covers how to track usage, analyze costs per task, and apply optimization strategies that can dramatically reduce your spending without sacrificing quality.
Token Counting Basics
Every API call consumes two types of tokens:
- Input tokens — your system prompt, conversation history, and user message
- Output tokens — Claude's response (these cost more, typically 3-5x input tokens)
A rough rule of thumb: 1 token ≈ 4 characters of English text, or about ¾ of a word. A 2,000-word essay is roughly 2,700 tokens.
Cost hierarchy: Haiku is ~60x cheaper than Opus per token. Sonnet sits in between. Always check the pricing page for current rates — prices change as models evolve.
Tracking Usage from API Responses
Every API response includes a usage object with exact token counts. This is your primary data source for cost tracking.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514", # Check docs.anthropic.com for latest model IDs
max_tokens=1024,
messages=[{"role": "user", "content": "Explain recursion in two sentences."}],
)
# Every response includes usage data
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Model: {response.model}")If you're using prompt caching, the usage object also includes cache_creation_input_tokens and cache_read_input_tokens, which have different cost rates.
Building a Cost Tracker
Wrap your API calls to automatically log usage. This example stores data in a simple list, but in production you'd write to a database or logging service.
import anthropic
from datetime import datetime
client = anthropic.Anthropic()
usage_log = []
def tracked_call(model: str, messages: list, **kwargs) -> str:
"""Make an API call and log token usage."""
response = client.messages.create(
model=model,
messages=messages,
**kwargs,
)
usage_log.append({
"timestamp": datetime.now().isoformat(),
"model": model,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cache_read": getattr(response.usage, "cache_read_input_tokens", 0),
"cache_write": getattr(response.usage, "cache_creation_input_tokens", 0),
})
return response.content[0].text
def print_usage_summary():
"""Print aggregate usage statistics."""
from collections import defaultdict
by_model = defaultdict(lambda: {"calls": 0, "input": 0, "output": 0})
for entry in usage_log:
m = by_model[entry["model"]]
m["calls"] += 1
m["input"] += entry["input_tokens"]
m["output"] += entry["output_tokens"]
print(f"\n{'Model':<35} {'Calls':>6} {'Input':>10} {'Output':>10}")
print("-" * 65)
for model, stats in by_model.items():
print(f"{model:<35} {stats['calls']:>6} {stats['input']:>10,} {stats['output']:>10,}")Cost-Per-Task Analysis
Knowing your total spend isn't enough — you need to know which tasks are expensive. Tag each call with a task category.
def tracked_call_with_task(model: str, messages: list, task: str, **kwargs) -> str:
"""Track usage broken down by task type."""
response = client.messages.create(model=model, messages=messages, **kwargs)
usage_log.append({
"task": task,
"model": model,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
})
return response.content[0].text
# Usage
tracked_call_with_task("claude-haiku-4-20250514", msgs, task="classification", max_tokens=50)
tracked_call_with_task("claude-sonnet-4-20250514", msgs, task="summarization", max_tokens=2048)This reveals optimization opportunities: if "classification" accounts for 60% of calls but uses Sonnet, switching to Haiku for that task could cut costs dramatically.
Optimization Strategies
1. Trim Your Prompts
Every token in your system prompt is charged on every single call. Remove verbose instructions. Use concise formatting. A system prompt that goes from 800 tokens to 300 tokens saves 500 input tokens per request — at scale, this adds up fast.
2. Use Prompt Caching
Repeated system prompts and long context documents should use prompt caching. Cached reads cost ~90% less than fresh input tokens. See Lesson 15: Prompt Caching for details.
3. Batch When Possible
The Batch API processes requests asynchronously at a 50% discount. If your use case tolerates delays (data pipelines, nightly jobs), batch everything. See Lesson 29: Batch API.
4. Choose the Right Model
Don't default to the most powerful model. Start with Haiku, move to Sonnet if quality is insufficient. Only use Opus for tasks that genuinely require it. See Lesson 26: Multi-Model Pipelines.
5. Limit Output Tokens
Set max_tokens to a reasonable ceiling for each task. A classification task doesn't need max_tokens=4096 — set it to 50.
6. Reduce Conversation Length
Long conversation histories mean ballooning input tokens. Summarize earlier turns, or use a sliding window that keeps only recent messages plus a summary.
Claude Code Usage Tracking
When working with Claude Code, you can monitor token usage directly:
# Set a token budget for a session
claude --token-budget 50000
# Check usage in conversation
# Type /cost during a session to see current usageClaude Code displays token usage after each interaction, showing both input and output token counts. Watch these numbers to understand which operations are expensive — file reads and large codebases consume significant input tokens.
Setting Budget Alerts
For production systems, implement spending checks in your cost tracker:
DAILY_BUDGET_TOKENS = 5_000_000 # Set your own threshold
def check_budget():
"""Warn if daily usage exceeds budget."""
from datetime import date
today = date.today().isoformat()
today_tokens = sum(
e["input_tokens"] + e["output_tokens"]
for e in usage_log
if e["timestamp"].startswith(today)
)
if today_tokens > DAILY_BUDGET_TOKENS:
print(f"⚠ BUDGET ALERT: {today_tokens:,} tokens used today "
f"(budget: {DAILY_BUDGET_TOKENS:,})")
return False
return TrueOn Anthropic's console, you can also set spending limits at the organization level, which act as a hard stop if your code has a runaway loop.
Key Takeaways
- Every API response includes exact token counts — use them, don't guess
- Track usage per model and per task to find your biggest optimization opportunities
- Prompt trimming, caching, batching, and model selection are the four biggest cost levers
- Set
max_tokensappropriately for each task — don't leave it at the maximum - Implement budget alerts to catch runaway costs before they become a problem
- Claude Code shows usage per interaction — use
/costto monitor your session