On this page
Lesson 29 of 30

Lesson 28: Multi-Model Pipelines

Not every token needs to come from your most powerful model. Multi-model pipelines let you combine Claude's model family — Haiku, Sonnet, and Opus — so you get top-tier quality where it matters and blazing-fast, cheap processing everywhere else.


Why Use Multiple Models?

A single model can't optimize for cost, speed, and quality simultaneously. Haiku is ~60x cheaper than Opus and responds in milliseconds, but it won't match Opus on complex reasoning. The insight: most tasks in a pipeline don't require your best model.

Consider a customer support system that handles 10,000 messages per day:

  • 80% are simple questions (FAQ lookups, order status) — Haiku handles these perfectly
  • 15% need nuanced understanding (complaints, policy edge cases) — Sonnet shines here
  • 5% are complex escalations (legal, safety, multi-step reasoning) — Opus is worth the cost

Routing 80% of requests to Haiku while keeping only 5% for Opus can reduce costs by 90%+ compared to sending everything to Opus.


The Haiku → Sonnet → Opus Pattern

The most common pipeline is a tiered escalation: start cheap, escalate when needed.

Python
import anthropic

client = anthropic.Anthropic()

# Check docs.anthropic.com for latest model IDs
MODELS = {
    "fast": "claude-haiku-4-20250514",
    "balanced": "claude-sonnet-4-20250514",
    "powerful": "claude-opus-4-20250514",
}

def tiered_response(prompt: str) -> dict:
    """Try the cheapest model first, escalate if confidence is low."""

    # Stage 1: Fast classification with Haiku
    result = client.messages.create(
        model=MODELS["fast"],
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )
    response_text = result.content[0].text

    # Stage 2: Self-assess — ask Haiku if it's confident
    check = client.messages.create(
        model=MODELS["fast"],
        max_tokens=50,
        messages=[
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": response_text},
            {"role": "user", "content": "Rate your confidence in that answer: HIGH or LOW. Reply with one word."},
        ],
    )

    confidence = check.content[0].text.strip().upper()

    if "HIGH" in confidence:
        return {"model_used": "fast", "response": response_text}

    # Stage 3: Escalate to Sonnet (or Opus for critical tasks)
    escalated = client.messages.create(
        model=MODELS["balanced"],
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}],
    )
    return {"model_used": "balanced", "response": escalated.content[0].text}

Router Patterns

Instead of escalating sequentially, a router classifies the task up front and sends it directly to the right model.

Python
def classify_and_route(prompt: str) -> str:
    """Use Haiku to classify, then route to the appropriate model."""

    classification = client.messages.create(
        model=MODELS["fast"],
        max_tokens=20,
        system="Classify the following user request as SIMPLE, MODERATE, or COMPLEX. "
               "SIMPLE: factual lookups, formatting, short answers. "
               "MODERATE: summarization, analysis, multi-step instructions. "
               "COMPLEX: creative writing, deep reasoning, code architecture, ambiguous problems. "
               "Reply with one word only.",
        messages=[{"role": "user", "content": prompt}],
    )

    difficulty = classification.content[0].text.strip().upper()

    model_map = {
        "SIMPLE": "fast",
        "MODERATE": "balanced",
        "COMPLEX": "powerful",
    }
    chosen = model_map.get(difficulty, "balanced")

    result = client.messages.create(
        model=MODELS[chosen],
        max_tokens=4096,
        messages=[{"role": "user", "content": prompt}],
    )

    return result.content[0].text

The classification call with Haiku is extremely cheap — fractions of a cent — so the overhead of routing is negligible compared to the savings.


Fallback Chains

Fallback chains handle failures gracefully. If a cheaper model returns poor output, you automatically retry with a more capable one.

Python
def fallback_chain(prompt: str, validators: list) -> str:
    """Try models from cheapest to most capable, validating each response."""

    for tier in ["fast", "balanced", "powerful"]:
        result = client.messages.create(
            model=MODELS[tier],
            max_tokens=4096,
            messages=[{"role": "user", "content": prompt}],
        )
        response = result.content[0].text

        # Run all validators — if any fail, escalate
        if all(v(response) for v in validators):
            print(f"Accepted response from {tier} model")
            return response
        print(f"{tier} model response failed validation, escalating...")

    return response  # Return best-effort from most powerful model

This pattern is especially effective when you have objective validators — JSON schema checks, unit tests, or rule-based scoring — that can programmatically decide whether a response is good enough.


When NOT to Use Pipelines

Multi-model pipelines add complexity. Skip them when:

  • Your volume is low. If you're making 100 API calls a day, the cost difference between Haiku and Opus is negligible. Don't over-engineer.
  • Latency is critical. Each routing step adds a round trip. For real-time chat, a single Sonnet call is often better than Haiku-classify → Sonnet-respond.
  • The task is uniformly complex. If every request genuinely needs deep reasoning, routing just wastes tokens on classification.
  • You're prototyping. Get it working with one model first. Optimize with pipelines after you've validated the product.

Rule of thumb: If you can't articulate clear categories of task difficulty in your domain, you probably don't need a router yet.


Key Takeaways

  • Match model capability to task difficulty — most requests don't need your most powerful model
  • The Haiku → Sonnet → Opus escalation pattern catches most tasks at the cheapest tier
  • Router patterns classify tasks up front, avoiding sequential latency
  • Fallback chains let you try cheap models first and automatically escalate on failure
  • Routing 80% of traffic to Haiku can cut costs by 90%+ compared to always using Opus
  • Don't add pipelines prematurely — complexity has its own cost