Lesson 23: Extended Thinking
What Is Extended Thinking?
Extended thinking gives Claude the ability to reason through a problem step by step before producing a final answer. Instead of responding immediately, Claude generates internal "thinking" content -- a chain of reasoning that explores the problem, considers tradeoffs, and works through intermediate steps. The final response is informed by this reasoning, producing noticeably better results on complex tasks.
Think of it as the difference between asking someone to answer a hard question off the top of their head versus giving them time to work it through on a whiteboard first.
Key point: Extended thinking is enabled by default in Claude Code. You are already benefiting from it. This lesson explains how it works, how to control it, and how to use it effectively in both Claude Code and the API.
How Extended Thinking Works in Claude Code
In Claude Code (the CLI), extended thinking is on by default. When you send a prompt, Claude thinks before responding -- you will see a "Thinking..." indicator while it reasons through the problem.
Toggle thinking on or off:
- Press
Alt+T(Windows/Linux) orOption+T(macOS) to toggle extended thinking - Use natural language: "think about this problem" or "think harder about this architecture"
Configure thinking globally:
{
"alwaysThinkingEnabled": true
}Set this via /config in Claude Code.
Control thinking depth with environment variables:
# Set effort level (recommended for newer models)
CLAUDE_CODE_EFFORT_LEVEL=low|medium|high
# Or set a manual thinking token budget
MAX_THINKING_TOKENS=10000
# Disable thinking entirely
MAX_THINKING_TOKENS=0Note: For newer models, the effort level is the recommended way to control thinking depth. The
MAX_THINKING_TOKENSvariable is ignored unless set to0(to disable thinking entirely).
Adaptive Thinking
Adaptive thinking is the recommended thinking mode for newer models. Instead of you setting a fixed token budget for thinking, Claude dynamically decides when and how much to think based on the complexity of each request.
Why adaptive thinking matters:
- Simple questions ("What is the capital of France?") get answered quickly with minimal or no thinking
- Complex problems (debugging a race condition, designing system architecture) trigger deep reasoning automatically
- You do not need to guess the right budget -- Claude allocates what the problem requires
How to use it in the API:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-6", # Check docs.anthropic.com for latest model IDs
max_tokens=16000,
thinking={"type": "adaptive"},
messages=[{
"role": "user",
"content": "Design a migration strategy for moving from a monolith to microservices."
}]
)
for block in response.content:
if block.type == "thinking":
print(f"Thinking: {block.thinking}")
elif block.type == "text":
print(f"Response: {block.text}")import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-opus-4-6", // Check docs.anthropic.com for latest model IDs
max_tokens: 16000,
thinking: { type: "adaptive" },
messages: [{
role: "user",
content: "Design a migration strategy for moving from a monolith to microservices."
}]
});
for (const block of response.content) {
if (block.type === "thinking") {
console.log(`Thinking: ${block.thinking}`);
} else if (block.type === "text") {
console.log(`Response: ${block.text}`);
}
}Adaptive thinking also automatically enables interleaved thinking -- Claude can think between tool calls, reasoning about intermediate results before deciding what to do next. This is especially effective for agentic workflows.
Important: Manual thinking mode (
type: "enabled"withbudget_tokens) is deprecated on newer models. Use adaptive thinking with the effort parameter instead. If you are already usingbudget_tokens, it continues to work but will be removed in a future release.
The Effort Parameter
The effort parameter gives you a dial to control how deeply Claude thinks. It works with adaptive thinking and affects all tokens in the response -- not just thinking, but also text output and tool calls.
| Effort Level | Thinking Behavior | When to Use |
|---|---|---|
max |
Always thinks with no constraints on depth | Tasks requiring the deepest possible reasoning |
high (default) |
Always thinks; deep reasoning on complex tasks | Complex coding, architecture, debugging |
medium |
Moderate thinking; may skip for simple queries | Balanced speed and quality for most tasks |
low |
Minimal thinking; skips for simple tasks | Speed-sensitive or straightforward tasks |
import anthropic
client = anthropic.Anthropic()
# Fast, lightweight response for a simple question
response = client.messages.create(
model="claude-opus-4-6", # Check docs.anthropic.com for latest model IDs
max_tokens=4096,
thinking={"type": "adaptive"},
output_config={"effort": "low"},
messages=[{"role": "user", "content": "What does the zip() function do in Python?"}]
)
# Deep reasoning for a complex architecture decision
response = client.messages.create(
model="claude-opus-4-6", # Check docs.anthropic.com for latest model IDs
max_tokens=16000,
thinking={"type": "adaptive"},
output_config={"effort": "max"},
messages=[{"role": "user", "content": "Analyze the race condition in this distributed lock implementation..."}]
)The effort parameter also works without thinking enabled. It controls overall token spend, including text responses and tool calls. At lower effort levels, Claude makes fewer tool calls, uses less preamble, and produces more concise output.
Tip: You can also tune thinking behavior via your system prompt. If Claude is thinking more than you want, add guidance like: "Extended thinking adds latency and should only be used when it will meaningfully improve answer quality."
Budget Tokens (Manual Mode)
For older models that do not support adaptive thinking, or when you need precise control over thinking token spend, you can set a manual thinking budget using budget_tokens.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6", # Check docs.anthropic.com for latest model IDs
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000 # Claude can use up to 10K tokens to think
},
messages=[{
"role": "user",
"content": "Prove that there are infinitely many prime numbers."
}]
)
for block in response.content:
if block.type == "thinking":
print(f"Thinking: {block.thinking}")
elif block.type == "text":
print(f"Answer: {block.text}")How much budget to allocate:
| Problem Complexity | Suggested Budget |
|---|---|
| Simple multi-step math | 1,024 -- 2,000 |
| Moderate reasoning / debugging | 4,000 -- 8,000 |
| Complex architecture / proofs | 10,000 -- 16,000 |
| Research synthesis / hard algorithms | 20,000+ |
Claude will not always use the full budget -- it only thinks as much as the problem warrants. The minimum budget is 1,024 tokens. budget_tokens must be less than max_tokens.
Which models support which mode:
| Mode | Configuration | Supported Models |
|---|---|---|
| Adaptive | thinking: {type: "adaptive"} |
Newer models (check docs for latest) |
| Manual | thinking: {type: "enabled", budget_tokens: N} |
All thinking-capable models (deprecated on newer models) |
| Disabled | Omit thinking parameter |
All models |
Summarized Thinking
When extended thinking is enabled on Claude 4 models, the API returns a summary of Claude's full thinking process rather than the raw thinking output. This is an important distinction.
What summarized thinking means for you:
- You get the key ideas and reasoning steps, not every intermediate token
- You are billed for the full thinking tokens, not the summary tokens -- the billed output token count will not match the visible token count in the response
- The first few lines of thinking output are more verbose and detailed, which is especially helpful for prompt engineering
- Summarization adds minimal latency and preserves the intelligence benefits of extended thinking
- The summarization is processed by a different model than the one generating the thinking
Example response structure:
{
"content": [
{
"type": "thinking",
"thinking": "The user is asking about proving infinitely many primes. I'll use Euclid's proof by contradiction...",
"signature": "WaUjzkypQ2mUEVM36O2TxuC06KN8..."
},
{
"type": "text",
"text": "Here is the proof that there are infinitely many prime numbers..."
}
]
}The signature field is an opaque encrypted value used to verify the thinking block was generated by Claude. You do not need to parse or interpret it. When passing thinking blocks back to the API (for tool use continuations), include the complete unmodified block.
Note: Claude Sonnet 3.7 returns full (unsummarized) thinking output. All Claude 4 models use summarization.
When Extended Thinking Helps Most
Extended thinking produces the biggest quality improvements on problems with these characteristics:
Complex multi-step reasoning Problems where the solution requires multiple logical steps that each depend on the previous one. Mathematical proofs, algorithm design, step-by-step debugging.
Ambiguous or underspecified problems When there are multiple valid interpretations, thinking lets Claude explore them and choose the most defensible path before committing.
Tradeoff analysis Architecture decisions, performance vs. maintainability choices, security vs. usability tradeoffs -- problems where the answer depends on careful weighing of competing factors.
Code debugging with subtle causes Bugs caused by timing issues, state mutation, race conditions, or incorrect assumptions about library behavior -- problems where jumping to an answer often produces wrong hypotheses.
Complex refactoring Refactoring that spans multiple files with order-of-operations constraints. Thinking helps Claude reason about dependency order, migration paths, and rollback strategies before generating the plan.
When to Reduce or Skip Thinking
Not every task benefits from extended thinking. For these cases, use low effort or disable thinking entirely:
- Simple factual lookups: "What is the syntax for a Python list comprehension?" -- no benefit
- Boilerplate generation: Code templates, CRUD scaffolding, trivial transformations
- Latency-sensitive applications: Thinking adds perceptible delay (sometimes seconds)
- High-volume batch processing: Thinking tokens increase cost at scale
- Simple classification or routing: Tasks where the answer space is small and well-defined
In Claude Code, you can toggle thinking off with Alt+T / Option+T for quick tasks, then re-enable it when you switch to something complex.
API Usage: Complete Examples
Basic API Call with Adaptive Thinking
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-6", # Check docs.anthropic.com for latest model IDs
max_tokens=16000,
thinking={"type": "adaptive"},
messages=[{
"role": "user",
"content": "Walk through the failure modes of this distributed cache implementation..."
}]
)
for block in response.content:
if block.type == "thinking":
print(f"REASONING: {block.thinking}")
elif block.type == "text":
print(f"ANSWER: {block.text}")Thinking with Tool Use
When using extended thinking with tools, you must pass thinking blocks back to the API when continuing with tool results. This preserves Claude's reasoning continuity.
import anthropic
client = anthropic.Anthropic()
weather_tool = {
"name": "get_weather",
"description": "Get current weather for a location",
"input_schema": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
}
# First request -- Claude thinks and decides to call a tool
response = client.messages.create(
model="claude-opus-4-6", # Check docs.anthropic.com for latest model IDs
max_tokens=16000,
thinking={"type": "adaptive"},
tools=[weather_tool],
messages=[{"role": "user", "content": "What's the weather in Paris?"}]
)
# Extract thinking and tool use blocks
thinking_block = next(
(block for block in response.content if block.type == "thinking"), None
)
tool_use_block = next(
(block for block in response.content if block.type == "tool_use"), None
)
# Continue with the tool result -- include the thinking block!
continuation = client.messages.create(
model="claude-opus-4-6", # Check docs.anthropic.com for latest model IDs
max_tokens=16000,
thinking={"type": "adaptive"},
tools=[weather_tool],
messages=[
{"role": "user", "content": "What's the weather in Paris?"},
{"role": "assistant", "content": [thinking_block, tool_use_block]},
{
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_use_block.id,
"content": "Current temperature: 72°F, sunny"
}],
},
],
)Important constraints for tool use with thinking:
- Only
tool_choice: {"type": "auto"}(the default) ortool_choice: {"type": "none"}are supported. Forced tool use is incompatible with thinking. - You cannot toggle thinking on or off in the middle of a tool use loop. The entire assistant turn must operate in a single thinking mode.
- Always pass back complete, unmodified thinking blocks.
Streaming Considerations
Thinking tokens fully support streaming. When streaming is enabled, you receive thinking content via thinking_delta events before the text_delta events for the final response.
import anthropic
client = anthropic.Anthropic()
with client.messages.stream(
model="claude-opus-4-6", # Check docs.anthropic.com for latest model IDs
max_tokens=16000,
thinking={"type": "adaptive"},
messages=[{
"role": "user",
"content": "What is the greatest common divisor of 1071 and 462?"
}]
) as stream:
for event in stream:
if event.type == "content_block_start":
block_type = event.content_block.type
print(f"\n[Starting {block_type} block]")
elif event.type == "content_block_delta":
if event.delta.type == "thinking_delta":
print(event.delta.thinking, end="", flush=True)
elif event.delta.type == "text_delta":
print(event.delta.text, end="", flush=True)import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const stream = await client.messages.stream({
model: "claude-opus-4-6", // Check docs.anthropic.com for latest model IDs
max_tokens: 16000,
thinking: { type: "adaptive" },
messages: [{
role: "user",
content: "What is the greatest common divisor of 1071 and 462?"
}]
});
for await (const event of stream) {
if (event.type === "content_block_start") {
console.log(`\n[Starting ${event.content_block.type} block]`);
} else if (event.type === "content_block_delta") {
if (event.delta.type === "thinking_delta") {
process.stdout.write(event.delta.thinking);
} else if (event.delta.type === "text_delta") {
process.stdout.write(event.delta.text);
}
}
}Streaming behavior notes:
- Text may arrive in larger chunks alternating with smaller token-by-token delivery -- this is expected
- The
signatureis added via asignature_deltajust before thecontent_block_stopevent - The SDKs require streaming when
max_tokensis greater than 21,333 to avoid HTTP timeouts
Cost Implications
Thinking tokens are billed at output token rates (not input rates). This is the most important cost consideration.
If Claude thinks for 8,000 tokens and produces 1,000 tokens of final answer, you are billed for your input tokens plus 9,000 output-rate tokens (8,000 thinking + 1,000 answer). With summarized thinking, the billed count reflects the full thinking, not the shorter summary you see.
Practical cost impact:
- A call with substantial thinking roughly doubles to triples total cost compared to a non-thinking call
- Thinking tokens from the last assistant turn count as input tokens when passed back in subsequent requests
- The effort parameter is the most practical way to manage thinking costs -- lower effort means fewer thinking tokens
Thinking and prompt caching:
- System prompts and tool definitions remain cached when thinking parameters change
- Changing
budget_tokensor switching between thinking modes invalidates message cache breakpoints - Adaptive thinking preserves cache breakpoints across consecutive requests using the same mode
Tip: Check anthropic.com/pricing for current per-token rates.
Practical Tips
1. Start with defaults in Claude Code. Extended thinking is already enabled. Use it as-is and only adjust when you have a specific reason.
2. Match effort to task complexity. Use low for quick questions and simple tasks. Use high or max for architecture decisions, complex debugging, and multi-step reasoning. Use medium as a balanced default for agentic workflows.
3. Use natural language to encourage deeper thinking. In Claude Code, phrasing like "think carefully about this" or "reason through this step by step" can encourage more thorough reasoning within the current thinking budget.
4. Do not over-allocate budget tokens. If using manual mode, start with the minimum (1,024) and increase incrementally. Claude will not think more just because you set a high ceiling -- but you pay for whatever it does use.
5. Use batch processing for very large thinking budgets. Thinking budgets above 32K tokens can cause long-running requests that may hit network timeouts. Use batch processing to avoid this.
6. Monitor your thinking token usage. The /cost command in Claude Code shows token usage statistics. Use it to understand how much of your budget is going to thinking versus response tokens.
7. Do not toggle thinking mid-turn. If Claude is in a tool use loop (calling tools and receiving results), the entire assistant turn must use the same thinking mode. Toggle thinking between turns, not during them.
8. Extended thinking is incompatible with some features. You cannot use temperature or top_k modifications, forced tool use, or response pre-filling when thinking is enabled. top_p can be set between 0.95 and 1.
Key Takeaways
- Extended thinking gives Claude a reasoning scratchpad before it commits to an answer, improving quality on complex tasks
- Adaptive thinking (
type: "adaptive") is the recommended mode for newer models -- Claude decides when and how much to think - The effort parameter (
low,medium,high,max) controls thinking depth without requiring you to set a token budget - Summarized thinking means you see a condensed version of Claude's reasoning, but you are billed for the full thinking tokens
- In Claude Code, thinking is on by default -- toggle it with
Alt+T/Option+Tor adjust via/config - Thinking tokens are billed at output token rates and can significantly increase cost -- use effort levels to manage this
- You can stream thinking tokens in real time for transparency in your application
- Reserve deep thinking for complex problems; use low effort or disable thinking for simple tasks where speed matters