Lesson 19: Context Management at Scale
The Context Window Problem
Every Claude model has a fixed context window — the total number of tokens it can "see" at once, including your system prompt, the full conversation history, any documents you've included, and the response being generated.
Current limits (verify at anthropic.com for latest):
- Claude 3.x (Haiku, Sonnet, Opus): 200,000 tokens
- Roughly 150,000 words, or ~500 pages
200K sounds enormous, but it fills up faster than you'd expect:
- A mid-size codebase: 50K–500K tokens
- A long conversation: 10K–50K tokens of history
- A reference document: 5K–30K tokens
- Large files (SQL dumps, logs): 20K–200K tokens
When the context fills, you have two choices: truncate (lose information) or summarize (compress information). Neither is free.
What to Include vs. Exclude
Being intentional about context is more valuable than having more context. Irrelevant tokens degrade quality — they dilute the signal.
Include:
- The specific files Claude needs to accomplish the task
- Relevant error messages, stack traces, and test output
- Decisions and constraints that affect the current task
- Examples of the pattern you want Claude to follow
Exclude:
- Files Claude won't touch in this task
- Old, resolved conversation branches
- Verbose logs when only the error line matters
- Duplicate or redundant information
In Claude Code: Use
/addto include specific files rather than letting Claude read everything. The model doesn't need to see your entire codebase to fix a bug in one file.
Strategy 1: Summarization
At regular intervals (or when context is getting full), have Claude summarize the conversation or relevant state, then replace the full history with the summary.
def summarize_and_compress(messages: list, client) -> list:
"""Compress conversation history into a summary."""
summary_response = client.messages.create(
model="claude-haiku-4-5", # Use cheap model for summarization
max_tokens=1000,
messages=[
*messages,
{
"role": "user",
"content": "Summarize the key decisions, code changes, and open questions "
"from this conversation in bullet points. Be concise."
}
]
)
summary = summary_response.content[0].text
# Return compressed history: just the summary as context
return [
{
"role": "user",
"content": f"[Previous conversation summary]\n{summary}\n\n[Continuing from here...]"
}
]
Strategy 2: Sliding Windows
Keep only the N most recent messages in context. Simple, but loses early context.
def sliding_window(messages: list, max_messages: int = 20) -> list:
"""Keep only the most recent messages."""
if len(messages) <= max_messages:
return messages
# Always keep the first message (often contains the original task)
return [messages[0]] + messages[-(max_messages - 1):]
A more sophisticated version combines sliding window with summarization: summarize the dropped messages and keep the summary as a "memory" prefix.
Strategy 3: Chunking Large Documents
When you need to process a document larger than available context, chunk it and process each chunk:
def process_large_document(document: str, question: str, chunk_size: int = 50000) -> str:
"""Process a document too large for a single context window."""
# Split into overlapping chunks to preserve cross-boundary context
words = document.split()
overlap = 500 # words
chunks = []
i = 0
while i < len(words):
chunk = " ".join(words[i:i + chunk_size])
chunks.append(chunk)
i += chunk_size - overlap
# Process each chunk and collect answers
partial_answers = []
for i, chunk in enumerate(chunks):
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1000,
messages=[{
"role": "user",
"content": f"Document section {i+1} of {len(chunks)}:\n\n{chunk}\n\n"
f"Question: {question}\n"
f"Answer based only on this section. If not present, say 'not found in this section'."
}]
)
partial_answers.append(response.content[0].text)
# Synthesize partial answers
synthesis = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2000,
messages=[{
"role": "user",
"content": f"These are partial answers to '{question}' from different sections:\n\n"
+ "\n\n".join(f"Section {i+1}: {a}" for i, a in enumerate(partial_answers))
+ "\n\nSynthesize a final, complete answer."
}]
)
return synthesis.content[0].text
RAG: Retrieval-Augmented Generation
For large knowledge bases, don't put everything in context — retrieve only what's relevant.
The RAG pattern:
- Index your documents (split into chunks, embed each chunk as a vector)
- At query time, embed the user's question
- Find the most similar document chunks (vector similarity search)
- Include only those chunks in the Claude prompt
# Pseudocode for RAG
def rag_answer(question: str, knowledge_base: VectorStore) -> str:
# Retrieve top-5 most relevant chunks
relevant_chunks = knowledge_base.search(question, top_k=5)
context = "\n\n".join(chunk.text for chunk in relevant_chunks)
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Use this context to answer the question:\n\n{context}\n\nQuestion: {question}"
}]
)
return response.content[0].text
Popular vector stores: Pinecone, Weaviate, pgvector (PostgreSQL), Chroma (local).
Token Counting and Budgeting
Count tokens before sending to avoid context overflow errors:
# Count tokens without sending a message
token_count = client.messages.count_tokens(
model="claude-sonnet-4-5",
messages=[{"role": "user", "content": your_prompt}]
)
print(f"Tokens: {token_count.input_tokens}")
if token_count.input_tokens > 150000:
print("Warning: approaching context limit, consider summarizing")
Context for Code: Smart File Selection
In coding tasks, context quality matters more than quantity. The right 10 files beat 100 random files.
Selection heuristics:
- The file being modified (always)
- Direct imports of that file
- The test file for that module
- The type definitions / interfaces it uses
- One example of a similar pattern elsewhere in the codebase
Exclude:
- Configuration files (unless the task involves config)
- Generated files (migrations, lock files, build output)
- Unrelated modules
- Vendor/dependency code
Key Takeaways
- 200K tokens sounds large but fills quickly with real codebases and long conversations
- Irrelevant context degrades quality — be intentional about what you include
- Use summarization to compress long conversation history; use cheap models (Haiku) for summarization
- Sliding windows preserve recent context; combine with summaries for best results
- Chunk oversized documents with overlap to preserve cross-boundary information
- RAG retrieves only relevant document sections instead of loading everything
- Count tokens proactively with
count_tokensto avoid overflow errors - For code tasks, select files by relevance to the task, not by proximity in the directory tree