Lesson 28: Multi-Model Pipelines
Not every token needs to come from your most powerful model. Multi-model pipelines let you combine Claude's model family — Haiku, Sonnet, and Opus — so you get top-tier quality where it matters and blazing-fast, cheap processing everywhere else.
Why Use Multiple Models?
A single model can't optimize for cost, speed, and quality simultaneously. Haiku is ~60x cheaper than Opus and responds in milliseconds, but it won't match Opus on complex reasoning. The insight: most tasks in a pipeline don't require your best model.
Consider a customer support system that handles 10,000 messages per day:
- 80% are simple questions (FAQ lookups, order status) — Haiku handles these perfectly
- 15% need nuanced understanding (complaints, policy edge cases) — Sonnet shines here
- 5% are complex escalations (legal, safety, multi-step reasoning) — Opus is worth the cost
Routing 80% of requests to Haiku while keeping only 5% for Opus can reduce costs by 90%+ compared to sending everything to Opus.
The Haiku → Sonnet → Opus Pattern
The most common pipeline is a tiered escalation: start cheap, escalate when needed.
import anthropic
client = anthropic.Anthropic()
# Check docs.anthropic.com for latest model IDs
MODELS = {
"fast": "claude-haiku-4-20250514",
"balanced": "claude-sonnet-4-20250514",
"powerful": "claude-opus-4-20250514",
}
def tiered_response(prompt: str) -> dict:
"""Try the cheapest model first, escalate if confidence is low."""
# Stage 1: Fast classification with Haiku
result = client.messages.create(
model=MODELS["fast"],
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
response_text = result.content[0].text
# Stage 2: Self-assess — ask Haiku if it's confident
check = client.messages.create(
model=MODELS["fast"],
max_tokens=50,
messages=[
{"role": "user", "content": prompt},
{"role": "assistant", "content": response_text},
{"role": "user", "content": "Rate your confidence in that answer: HIGH or LOW. Reply with one word."},
],
)
confidence = check.content[0].text.strip().upper()
if "HIGH" in confidence:
return {"model_used": "fast", "response": response_text}
# Stage 3: Escalate to Sonnet (or Opus for critical tasks)
escalated = client.messages.create(
model=MODELS["balanced"],
max_tokens=2048,
messages=[{"role": "user", "content": prompt}],
)
return {"model_used": "balanced", "response": escalated.content[0].text}Router Patterns
Instead of escalating sequentially, a router classifies the task up front and sends it directly to the right model.
def classify_and_route(prompt: str) -> str:
"""Use Haiku to classify, then route to the appropriate model."""
classification = client.messages.create(
model=MODELS["fast"],
max_tokens=20,
system="Classify the following user request as SIMPLE, MODERATE, or COMPLEX. "
"SIMPLE: factual lookups, formatting, short answers. "
"MODERATE: summarization, analysis, multi-step instructions. "
"COMPLEX: creative writing, deep reasoning, code architecture, ambiguous problems. "
"Reply with one word only.",
messages=[{"role": "user", "content": prompt}],
)
difficulty = classification.content[0].text.strip().upper()
model_map = {
"SIMPLE": "fast",
"MODERATE": "balanced",
"COMPLEX": "powerful",
}
chosen = model_map.get(difficulty, "balanced")
result = client.messages.create(
model=MODELS[chosen],
max_tokens=4096,
messages=[{"role": "user", "content": prompt}],
)
return result.content[0].textThe classification call with Haiku is extremely cheap — fractions of a cent — so the overhead of routing is negligible compared to the savings.
Fallback Chains
Fallback chains handle failures gracefully. If a cheaper model returns poor output, you automatically retry with a more capable one.
def fallback_chain(prompt: str, validators: list) -> str:
"""Try models from cheapest to most capable, validating each response."""
for tier in ["fast", "balanced", "powerful"]:
result = client.messages.create(
model=MODELS[tier],
max_tokens=4096,
messages=[{"role": "user", "content": prompt}],
)
response = result.content[0].text
# Run all validators — if any fail, escalate
if all(v(response) for v in validators):
print(f"Accepted response from {tier} model")
return response
print(f"{tier} model response failed validation, escalating...")
return response # Return best-effort from most powerful modelThis pattern is especially effective when you have objective validators — JSON schema checks, unit tests, or rule-based scoring — that can programmatically decide whether a response is good enough.
When NOT to Use Pipelines
Multi-model pipelines add complexity. Skip them when:
- Your volume is low. If you're making 100 API calls a day, the cost difference between Haiku and Opus is negligible. Don't over-engineer.
- Latency is critical. Each routing step adds a round trip. For real-time chat, a single Sonnet call is often better than Haiku-classify → Sonnet-respond.
- The task is uniformly complex. If every request genuinely needs deep reasoning, routing just wastes tokens on classification.
- You're prototyping. Get it working with one model first. Optimize with pipelines after you've validated the product.
Rule of thumb: If you can't articulate clear categories of task difficulty in your domain, you probably don't need a router yet.
Key Takeaways
- Match model capability to task difficulty — most requests don't need your most powerful model
- The Haiku → Sonnet → Opus escalation pattern catches most tasks at the cheapest tier
- Router patterns classify tasks up front, avoiding sequential latency
- Fallback chains let you try cheap models first and automatically escalate on failure
- Routing 80% of traffic to Haiku can cut costs by 90%+ compared to always using Opus
- Don't add pipelines prematurely — complexity has its own cost