Lesson 28: Multi-Model Pipelines

Not every token needs to come from your most powerful model. Multi-model pipelines let you combine Claude's model family — Haiku, Sonnet, and Opus — so you get top-tier quality where it matters and blazing-fast, cheap processing everywhere else.

Why Use Multiple Models?

A single model can't optimize for cost, speed, and quality simultaneously. Haiku is ~60x cheaper than Opus and responds in milliseconds, but it won't match Opus on complex reasoning. The insight: most tasks in a pipeline don't require your best model.

Consider a customer support system that handles 10,000 messages per day:

80% are simple questions (FAQ lookups, order status) — Haiku handles these perfectly
15% need nuanced understanding (complaints, policy edge cases) — Sonnet shines here
5% are complex escalations (legal, safety, multi-step reasoning) — Opus is worth the cost

Routing 80% of requests to Haiku while keeping only 5% for Opus can reduce costs by 90%+ compared to sending everything to Opus.

The Haiku → Sonnet → Opus Pattern

The most common pipeline is a tiered escalation: start cheap, escalate when needed.

import anthropic

client = anthropic.Anthropic()

# Check docs.anthropic.com for latest model IDs
MODELS = {
    "fast": "claude-haiku-4-20250514",
    "balanced": "claude-sonnet-4-20250514",
    "powerful": "claude-opus-4-20250514",
}

def tiered_response(prompt: str) -> dict:
    """Try the cheapest model first, escalate if confidence is low."""

    # Stage 1: Fast classification with Haiku
    result = client.messages.create(
        model=MODELS["fast"],
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )
    response_text = result.content[0].text

    # Stage 2: Self-assess — ask Haiku if it's confident
    check = client.messages.create(
        model=MODELS["fast"],
        max_tokens=50,
        messages=[
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": response_text},
            {"role": "user", "content": "Rate your confidence in that answer: HIGH or LOW. Reply with one word."},
        ],
    )

    confidence = check.content[0].text.strip().upper()

    if "HIGH" in confidence:
        return {"model_used": "fast", "response": response_text}

    # Stage 3: Escalate to Sonnet (or Opus for critical tasks)
    escalated = client.messages.create(
        model=MODELS["balanced"],
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}],
    )
    return {"model_used": "balanced", "response": escalated.content[0].text}

Router Patterns

Instead of escalating sequentially, a router classifies the task up front and sends it directly to the right model.

def classify_and_route(prompt: str) -> str:
    """Use Haiku to classify, then route to the appropriate model."""

    classification = client.messages.create(
        model=MODELS["fast"],
        max_tokens=20,
        system="Classify the following user request as SIMPLE, MODERATE, or COMPLEX. "
               "SIMPLE: factual lookups, formatting, short answers. "
               "MODERATE: summarization, analysis, multi-step instructions. "
               "COMPLEX: creative writing, deep reasoning, code architecture, ambiguous problems. "
               "Reply with one word only.",
        messages=[{"role": "user", "content": prompt}],
    )

    difficulty = classification.content[0].text.strip().upper()

    model_map = {
        "SIMPLE": "fast",
        "MODERATE": "balanced",
        "COMPLEX": "powerful",
    }
    chosen = model_map.get(difficulty, "balanced")

    result = client.messages.create(
        model=MODELS[chosen],
        max_tokens=4096,
        messages=[{"role": "user", "content": prompt}],
    )

    return result.content[0].text

The classification call with Haiku is extremely cheap — fractions of a cent — so the overhead of routing is negligible compared to the savings.

Fallback Chains

Fallback chains handle failures gracefully. If a cheaper model returns poor output, you automatically retry with a more capable one.

def fallback_chain(prompt: str, validators: list) -> str:
    """Try models from cheapest to most capable, validating each response."""

    for tier in ["fast", "balanced", "powerful"]:
        result = client.messages.create(
            model=MODELS[tier],
            max_tokens=4096,
            messages=[{"role": "user", "content": prompt}],
        )
        response = result.content[0].text

        # Run all validators — if any fail, escalate
        if all(v(response) for v in validators):
            print(f"Accepted response from {tier} model")
            return response
        print(f"{tier} model response failed validation, escalating...")

    return response  # Return best-effort from most powerful model

This pattern is especially effective when you have objective validators — JSON schema checks, unit tests, or rule-based scoring — that can programmatically decide whether a response is good enough.

When NOT to Use Pipelines

Multi-model pipelines add complexity. Skip them when:

Your volume is low. If you're making 100 API calls a day, the cost difference between Haiku and Opus is negligible. Don't over-engineer.
Latency is critical. Each routing step adds a round trip. For real-time chat, a single Sonnet call is often better than Haiku-classify → Sonnet-respond.
The task is uniformly complex. If every request genuinely needs deep reasoning, routing just wastes tokens on classification.
You're prototyping. Get it working with one model first. Optimize with pipelines after you've validated the product.

Rule of thumb: If you can't articulate clear categories of task difficulty in your domain, you probably don't need a router yet.

Key Takeaways

Match model capability to task difficulty — most requests don't need your most powerful model
The Haiku → Sonnet → Opus escalation pattern catches most tasks at the cheapest tier
Router patterns classify tasks up front, avoiding sequential latency
Fallback chains let you try cheap models first and automatically escalate on failure
Routing 80% of traffic to Haiku can cut costs by 90%+ compared to always using Opus
Don't add pipelines prematurely — complexity has its own cost