Lesson 20 of 20

Lesson 19: Context Management at Scale

The Context Window Problem

Every Claude model has a fixed context window — the total number of tokens it can "see" at once, including your system prompt, the full conversation history, any documents you've included, and the response being generated.

Current limits (verify at anthropic.com for latest):

  • Claude 3.x (Haiku, Sonnet, Opus): 200,000 tokens
  • Roughly 150,000 words, or ~500 pages

200K sounds enormous, but it fills up faster than you'd expect:

  • A mid-size codebase: 50K–500K tokens
  • A long conversation: 10K–50K tokens of history
  • A reference document: 5K–30K tokens
  • Large files (SQL dumps, logs): 20K–200K tokens

When the context fills, you have two choices: truncate (lose information) or summarize (compress information). Neither is free.


What to Include vs. Exclude

Being intentional about context is more valuable than having more context. Irrelevant tokens degrade quality — they dilute the signal.

Include:

  • The specific files Claude needs to accomplish the task
  • Relevant error messages, stack traces, and test output
  • Decisions and constraints that affect the current task
  • Examples of the pattern you want Claude to follow

Exclude:

  • Files Claude won't touch in this task
  • Old, resolved conversation branches
  • Verbose logs when only the error line matters
  • Duplicate or redundant information

In Claude Code: Use /add to include specific files rather than letting Claude read everything. The model doesn't need to see your entire codebase to fix a bug in one file.


Strategy 1: Summarization

At regular intervals (or when context is getting full), have Claude summarize the conversation or relevant state, then replace the full history with the summary.

def summarize_and_compress(messages: list, client) -> list:
    """Compress conversation history into a summary."""
    
    summary_response = client.messages.create(
        model="claude-haiku-4-5",  # Use cheap model for summarization
        max_tokens=1000,
        messages=[
            *messages,
            {
                "role": "user",
                "content": "Summarize the key decisions, code changes, and open questions "
                           "from this conversation in bullet points. Be concise."
            }
        ]
    )
    
    summary = summary_response.content[0].text
    
    # Return compressed history: just the summary as context
    return [
        {
            "role": "user",
            "content": f"[Previous conversation summary]\n{summary}\n\n[Continuing from here...]"
        }
    ]

Strategy 2: Sliding Windows

Keep only the N most recent messages in context. Simple, but loses early context.

def sliding_window(messages: list, max_messages: int = 20) -> list:
    """Keep only the most recent messages."""
    if len(messages) <= max_messages:
        return messages
    
    # Always keep the first message (often contains the original task)
    return [messages[0]] + messages[-(max_messages - 1):]

A more sophisticated version combines sliding window with summarization: summarize the dropped messages and keep the summary as a "memory" prefix.


Strategy 3: Chunking Large Documents

When you need to process a document larger than available context, chunk it and process each chunk:

def process_large_document(document: str, question: str, chunk_size: int = 50000) -> str:
    """Process a document too large for a single context window."""
    
    # Split into overlapping chunks to preserve cross-boundary context
    words = document.split()
    overlap = 500  # words
    chunks = []
    
    i = 0
    while i < len(words):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
        i += chunk_size - overlap
    
    # Process each chunk and collect answers
    partial_answers = []
    for i, chunk in enumerate(chunks):
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1000,
            messages=[{
                "role": "user",
                "content": f"Document section {i+1} of {len(chunks)}:\n\n{chunk}\n\n"
                           f"Question: {question}\n"
                           f"Answer based only on this section. If not present, say 'not found in this section'."
            }]
        )
        partial_answers.append(response.content[0].text)
    
    # Synthesize partial answers
    synthesis = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=2000,
        messages=[{
            "role": "user",
            "content": f"These are partial answers to '{question}' from different sections:\n\n"
                       + "\n\n".join(f"Section {i+1}: {a}" for i, a in enumerate(partial_answers))
                       + "\n\nSynthesize a final, complete answer."
        }]
    )
    return synthesis.content[0].text

RAG: Retrieval-Augmented Generation

For large knowledge bases, don't put everything in context — retrieve only what's relevant.

The RAG pattern:

  1. Index your documents (split into chunks, embed each chunk as a vector)
  2. At query time, embed the user's question
  3. Find the most similar document chunks (vector similarity search)
  4. Include only those chunks in the Claude prompt
# Pseudocode for RAG
def rag_answer(question: str, knowledge_base: VectorStore) -> str:
    # Retrieve top-5 most relevant chunks
    relevant_chunks = knowledge_base.search(question, top_k=5)
    
    context = "\n\n".join(chunk.text for chunk in relevant_chunks)
    
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Use this context to answer the question:\n\n{context}\n\nQuestion: {question}"
        }]
    )
    return response.content[0].text

Popular vector stores: Pinecone, Weaviate, pgvector (PostgreSQL), Chroma (local).


Token Counting and Budgeting

Count tokens before sending to avoid context overflow errors:

# Count tokens without sending a message
token_count = client.messages.count_tokens(
    model="claude-sonnet-4-5",
    messages=[{"role": "user", "content": your_prompt}]
)

print(f"Tokens: {token_count.input_tokens}")

if token_count.input_tokens > 150000:
    print("Warning: approaching context limit, consider summarizing")

Context for Code: Smart File Selection

In coding tasks, context quality matters more than quantity. The right 10 files beat 100 random files.

Selection heuristics:

  • The file being modified (always)
  • Direct imports of that file
  • The test file for that module
  • The type definitions / interfaces it uses
  • One example of a similar pattern elsewhere in the codebase

Exclude:

  • Configuration files (unless the task involves config)
  • Generated files (migrations, lock files, build output)
  • Unrelated modules
  • Vendor/dependency code

Key Takeaways

  • 200K tokens sounds large but fills quickly with real codebases and long conversations
  • Irrelevant context degrades quality — be intentional about what you include
  • Use summarization to compress long conversation history; use cheap models (Haiku) for summarization
  • Sliding windows preserve recent context; combine with summaries for best results
  • Chunk oversized documents with overlap to preserve cross-boundary information
  • RAG retrieves only relevant document sections instead of loading everything
  • Count tokens proactively with count_tokens to avoid overflow errors
  • For code tasks, select files by relevance to the task, not by proximity in the directory tree