/blog · pipeline routing notes from the LLM hub

ENGINEERING · LLM HUB
PUBLISHED 2026-05-08 · ~10-MIN READ · WARRANT ENGINEERING

two models. one pipeline.

Opus 4.7 for judgement. Sonnet 4.6 for structure. Why mixing models cuts the per-trace cost roughly in half at constant prompt size, at within 1.5pp of all-Opus accuracy.

ROUTING
Opus + Sonnet· 4.7 · 4.6
Opus 4.7 for Classify and Assess (judgement). Sonnet 4.6 for Extract and Map (structure).
COST
≈ $0.20· per trace
Mixed routing at the per-stage budget shown in section 05. Reconciled against published Anthropic pricing.
RATIO
≈ 2× cheaper· vs all-Opus
1.5pp pass-rate cost on the 200-trace regression set. Paired McNemar p-value reported in eval card.
W
ROUTED · 4 STAGES · 2 MODELS
Classify and Assess on Opus 4.7. Extract and Map on Sonnet 4.6. One evidence PDF, four model calls, one log per trace.
01 · THE ALL-OPUS DEFAULT

The all-Opus default fails the unit-economics test.

Trace shape · ~12K input tokens, ~3K output tokens across four stages · public Anthropic pricing

The default reflex on agentic pipelines is "just use the strongest model everywhere." It is the path of least resistance and it does not survive a spreadsheet.

A representative Warrant trace ingests ~12K input tokens and emits ~3K output tokens across four stages. Opus 4.7 lists at $15 per million input tokens, $75 per million output tokens. Run the math on every stage:

arithmetic · all-opus
# input cost per trace
12,000 input tokens  *  $15 / 1,000,000  =  $0.180

# output cost per trace
3,000 output tokens  *  $75 / 1,000,000  =  $0.225

# total per trace, all stages on opus 4.7
$0.180 + $0.225 = $0.405
≈ $0.40 per trace at the per-stage budget shown in section 05

That is the all-Opus baseline at the per-stage token budget the table in section 05 publishes. The same per-stage budget routed across Opus and Sonnet lands at ~$0.20. The all-Opus number and the mixed number both reconcile to the same input and output token totals; the routing decision changes the per-token rate, not the budget.

At 1,000 traces a day, all-Opus burns $400 a day, ~$12,000 a month, on a single workload. Mixed cuts the same load to ~$200 a day, ~$6,000 a month. The cheapest-model default fails the other way, judgement scores collapse and the auditor walks. Mixing is the only correct answer, the question is which stage gets which model.

02 · JUDGEMENT VS STRUCTURE

Judgement vs structure.

Axis definition · per-stage taxonomy · pipeline as four model calls, not one

Stages split cleanly into two camps once you stop treating them as one workload. Judgement versus structure.

JUDGEMENT

Classify and Assess.

"Is this an advisory action or an informational one." "Does this action require a regulator-recognised authorisation that is not present in the trace." Calls that hinge on world knowledge of regulator language, on understanding what counts as a recommendation under the relevant regime.

STRUCTURE

Extract and Map.

"Pull the structured action object out of this trace." "For each action, list the obligation IDs it triggers from the menu." Calls that hinge on producing a deterministic JSON object that fits a schema, not on judgement.

The cost difference between Opus 4.7 and Sonnet 4.6 is asymmetric across that split. Opus is markedly stronger on judgement and only marginally stronger on structure. Sonnet is dead even with Opus on schema-conformant JSON and four times cheaper. The optimisation is to send each stage to the model whose strengths the stage consumes.

03 · OPUS ON JUDGEMENT

Why Opus on the judgement stages.

Classify · Assess · two real failure modes from the regression set

The "advisory in disguise" case is the cleanest example. A retail brokerage trace contains a passage where the agent says "based on your recent portfolio mix, you might consider rebalancing toward duration-shorter exposure." Sonnet 4.6 classifies the action as informational, on the basis that the verb "consider" softens the recommendation. Opus 4.7 classifies it as advisory, picks up the implicit personal recommendation embedded in "based on your portfolio," and flags it for the Assess stage.

In nine of nine retail-advisory regimes mapped on the platform, that classification difference is the line between a triggered obligation and a clean trace. A pipeline that misses the recommendation produces a clean evidence PDF an examiner will reject the moment the regulator reads it as advisory. The Opus call costs more, the missed regulator finding costs more.

Assess is the same shape. Sonnet 4.6 will accept generic platform terms-of-service as adequate authorisation. Opus 4.7 distinguishes between a platform consent click and a regulator-recognised authorisation under the relevant regime. The world-knowledge gap on regulator language is the load-bearing capability, and the price differential is the price of that gap.

The agreement metric is Cohen's kappa, scored two ways on the 200-trace regression set. Inter-annotator kappa on the labels themselves is 0.84, the standard "substantial agreement" floor and the reason the gold is usable. Model-vs-human kappa on Classify is 0.86 on Opus 4.7 and 0.79 on Sonnet 4.6 against the same labels; on Assess, 0.83 on Opus and 0.71 on Sonnet. The kappa difference between the two models on the same task is what the routing rests on, not the model's distance from the labels. The gap concentrates on the cases where the trace contains regulator-language ambiguity, not on the easy cases. These are unweighted Cohen's kappa; full label-confusion matrices and 95% bootstrap CIs ship in the eval card on /trust.

04 · SONNET ON STRUCTURE

Why Sonnet on the structure stages.

Anthropic tool-use schema · deterministic JSON return · 99.6% schema-conformance on Sonnet 4.6

Extract and Map do not consume world knowledge of regulator language. They consume schema discipline. The Anthropic tool-use API returns a JSON object the SDK parses against a declared input schema, and the model is graded on schema conformance, not on prose.

Sonnet 4.6 hits the schema as reliably as Opus on both stages. The regression set sees 99.6% schema-conformant returns on Sonnet for Extract versus 99.7% on Opus, inside the noise floor. The Map gap is similar. At a quarter of the input cost and a fifth of the output cost, Sonnet is the right model for the load.

The tool definition is a Pydantic schema flattened into the Anthropic tool-input format. Pydantic is the source of truth, the tool schema is generated from it:

python · extract stage tool definition
from pydantic import BaseModel, Field
from typing import Literal

class Action(BaseModel):
    actor: str = Field(..., description="the agent or system component performing the action")
    action: str = Field(..., description="a verb-phrase describing the operation")
    subject: str = Field(..., description="the natural or legal person acted upon")
    inputs: list[str]
    outputs: list[str]
    timestamp: str = Field(..., description="ISO-8601 UTC")
    kind: Literal["advisory", "informational", "transactional"]

class ExtractResult(BaseModel):
    actions: list[Action]

# the tool registered with the Anthropic SDK
extract_tool = {
    "name": "emit_actions",
    "description": "emit the structured list of actions found in the trace",
    "input_schema": ExtractResult.model_json_schema(),
}

Map uses the same pattern. A list of obligation IDs per action, every ID drawn from the menu in regulations.json, no free-form citations. Sonnet handles both with the discipline the tool-use API enforces, at structure-tier cost.

05 · THE COST TABLE

The cost table at production prompt sizes.

Per-stage tokens measured on production traces · Anthropic published pricing 2026-05-08

Per-stage prompt budgets, model assignment, and the resulting cost per trace. Opus 4.7 at $15 / $75 per million tokens, Sonnet 4.6 at $3 / $15 per million tokens. The arithmetic is the table:

Stage Input tokens Output tokens Model Cost Rationale
Classify 1.2K 280 Opus 4.7 $0.039 judgement
Extract 4.8K 1.4K Sonnet 4.6 $0.035 structure
Assess 3.6K 720 Opus 4.7 $0.108 judgement
Map 2.4K 600 Sonnet 4.6 $0.016 structure
total $0.198 ~$0.18–0.20

Production traces drift inside a $0.18 to $0.22 band depending on input size and output verbosity. The all-Opus version of the same workload, at the same per-stage budget, sits at ~$0.40 per trace. The ratio is ~2x, the eval delta is 1.5 percentage points. That is the trade.

The router is eight lines, no orchestration graph, no dependency, no abstraction layer:

python · the router
from enum import Enum

class Stage(Enum):
    CLASSIFY = "classify"
    EXTRACT  = "extract"
    ASSESS   = "assess"
    MAP      = "map"

OPUS   = "claude-opus-4-7"
SONNET = "claude-sonnet-4-6"

def model_for(stage: Stage) -> str:
    judgement = {Stage.CLASSIFY, Stage.ASSESS}
    return OPUS if stage in judgement else SONNET

That function is the production routing logic. No declarative DSL, no condition tree, no orchestration graph. The four call sites read model_for(stage) and pass the string to the Anthropic SDK. When the next model ships, the constant changes and the eval re-runs. The cost of the abstraction is roughly nothing, the cost of skipping it is a six-figure annual line item.

06 · THE EVAL

The eval that justified this.

200-trace regression set · per-stage golden outputs · A/B/C of all-Opus, all-Sonnet, mixed

The routing decision is not a hunch. It rests on a regression set of 200 production-shaped traces drawn from the lending, advisory, and KYC sample workloads, labelled by a regulated-services compliance reviewer with golden outputs at every stage.

Three pipelines run against the set in parallel. All-Opus, all-Sonnet, mixed. Per-stage agreement is scored as Cohen's kappa against the human reviewer. Aggregate quality is the per-trace pass rate, where pass means every stage's output matches the golden within a defined tolerance.

Configuration Cost / trace Pass rate Classify κ Extract conf. Assess κ Map conf.
All-Opus 4.7 ~$0.40 92.5% 0.86 99.7% 0.85 99.4%
Mixed (production) ~$0.20 91.0% 0.84 99.6% 0.84 99.2%
All-Sonnet 4.6 ~$0.06 78.5% 0.71 99.6% 0.69 99.2%

Mixed is 1.5pp behind all-Opus on pass rate, 12.5pp ahead of all-Sonnet, and ~2x cheaper than all-Opus at constant prompt size. All-Sonnet's failure surface is exactly where the judgement gap is real. Mixed is the configuration where every gain is collected and no judgement is sacrificed.

07 · ROUTING FAILURE MODES

Routing failure modes we caught early.

Three failure surfaces · each one drove a routing choice in the table

Three failure modes surfaced during the eval. Each mapped to a routing choice. None was theoretical, all three came out of the regression set on a real trace.

Sonnet under-classifies personal recommendations as informational

The "advisory in disguise" case. Sonnet treats hedged recommendation language as informational, missing the personal recommendation that triggers the obligation. Failure rate on the advisory subset of the regression set ran ~9pp higher than Opus. Drove the Classify-on-Opus decision.

Sonnet hallucinates sub-clauses on long obligation menus

When the Map stage receives a regulator menu of more than 30 obligation IDs, Sonnet occasionally returns an ID with a sub-clause suffix that does not exist in the published regulation. Rare on the menus served in production today, common enough on long synthetic menus to disqualify Haiku 4.5 entirely. Drove the Sonnet-as-floor decision on Map.

Opus over-cites on Extract

Opus on Extract produces verbose JSON, with optional fields filled in even when the schema marks them optional and the trace does not warrant them. The output is correct but bloated, ~30% more output tokens than Sonnet on the same trace. Drove the Extract-on-Sonnet decision, where Sonnet returns the leaner schema-minimal object the downstream stages prefer.

Each failure mode matched a stage to a model. The table in section 05 is the output of these three observations and the eval that confirmed them.

08 · WHEN ANTHROPIC SHIPS 4.8

What happens when Anthropic ships 4.8.

Eight-line router · model_id as constant · re-run the eval, swap the constant

The point of an eight-line router is the cost of the next swap. When Anthropic ships Opus 4.8 or Sonnet 4.7, the work is the eval, not a migration. Re-run the regression set, compare kappa and pass rate, swap the model_id constant if the new model wins. The compute cost is ~8 minutes.

There is no graph to refactor, no chain template to rewrite, no abstraction to migrate. The pipeline is four explicit Anthropic API calls, each one parameterised by a string. The router is a function from stage to string.

The same logic runs in the other direction. When the eval shows Sonnet has caught up to Opus on Classify, the constant moves and cost-per-trace drops without any code review beyond the constant change. The pipeline produces the same evidence record, audited the same way. EU AI Act Article 12(1) requires automatic recording of events over the lifetime of a high-risk AI system. The multi-model pipeline produces one log per trace, not four. The auditor sees one evidence record mapped to a specific EU AI Act obligation, not a routing diagram.

That logging invariant is the property the routing has to preserve, and the property the architecture is built around.

09 · FAQ

Questions an architect asks first.

FAQ · drawn from inbound from platform and ML teams Apr to May 2026
Why not Haiku 4.5 on the structure stages?

We tested Haiku 4.5 on the Map stage. Eval scores dropped 11pp on obligation citation accuracy, with the failure concentrated on long obligation menus where Haiku invented sub-clauses that did not exist in the published regulation. The cost gain was real, the correctness loss was disqualifying. Sonnet 4.6 is the floor on a stage that produces the citation an auditor will read.

Do you ever fall back if a model is rate-limited?

Yes. A circuit breaker on a 429 response from the Sonnet endpoint swaps the structure stages to Haiku 4.5 for the duration of the trace. Each fallback is logged in the evidence PDF as a fallback record, with the original model_id, the fallback model_id, and the wall-clock at swap. The auditor sees the fallback in the same evidence record, which is independently verifiable without contacting Warrant.

What about open-source models like Llama or Qwen?

We tested Qwen-2.5-72B on the Extract stage. Structured-output reliability landed at 88% versus 99.6% for Sonnet 4.6 on the same regression set. The remaining 12% returned malformed JSON or hallucinated optional fields. We are re-evaluating on Qwen-3 when the weights and tool-use tooling stabilise. The router is model-agnostic, the model_id is a string constant.

How do you measure judgement objectively?

Agreement with a human compliance reviewer on a 200-trace regression set. Per-stage golden outputs labelled by a regulated-services compliance lead. Two kappa numbers, scored separately. Inter-annotator kappa on the labels is 0.84, the floor that validates the gold. Model-vs-human kappa on Classify is 0.86 on Opus 4.7 and 0.79 on Sonnet 4.6 against the same labels. The kappa difference between the two models on the same task is what the routing decision rests on, not the model's distance from the labels. These are unweighted Cohen's kappa; full label-confusion matrices and 95% bootstrap CIs ship in the eval card on /trust.

Does this work with Azure or AWS Bedrock-hosted Claude?

Yes. The router is a function from stage to a string model identifier. The hosting layer is a runtime constant, swapped at deploy time. Production today runs on a mix of Anthropic and Azure Foundry, with the same model selection per stage on each. Bedrock works the same way, the model_id changes, the routing logic does not.

What is the carbon argument?

Sonnet 4.6 is roughly an order of magnitude cheaper to serve than Opus 4.7 on the public price sheet, and that ratio tracks underlying inference compute closely. Routing structure stages to Sonnet cuts inference compute on those stages by ~30% per trace at the pipeline level. At a thousand traces a day the difference is measurable, at a million it compounds into a real load reduction.

10 · A WORKED CLASSIFY PROMPT

What a Classify prompt actually looks like.

System + user + two few-shot exemplars · Opus 4.7 · temperature 0.2

The Classify prompt is a system block, a user block carrying the trace, and two few-shot exemplars that demonstrate the advisory-versus-informational distinction the stage draws. Short on purpose:

python · classify prompt construction
system = """You categorise a single AI agent trace against the
9 regulatory regimes in obligations.json across the 6 jurisdictions
the platform covers today. You return a single tool call to the
classify_trace tool. You do not return prose."""

few_shot = [
    {"role": "user", "content": TRACE_ADVISORY_DISGUISED},
    {"role": "assistant", "content": [
        {"type": "tool_use", "name": "classify_trace",
         "input": {"kind": "advisory", "jurisdictions": ["EU", "IN"],
                   "regimes": ["EU_AI_ACT", "SEBI_RAF"], "risk_tier": "high"}}
    ]},
    {"role": "user", "content": TRACE_INFORMATIONAL_CLEAN},
    {"role": "assistant", "content": [
        {"type": "tool_use", "name": "classify_trace",
         "input": {"kind": "informational", "jurisdictions": ["EU"],
                   "regimes": ["EU_AI_ACT"], "risk_tier": "low"}}
    ]},
]

response = client.messages.create(
    model=model_for(Stage.CLASSIFY),
    system=system,
    messages=few_shot + [{"role": "user", "content": trace_text}],
    tools=[classify_tool],
    tool_choice={"type": "tool", "name": "classify_trace"},
    temperature=0.2,
)

Two exemplars, not five, on purpose. The first demonstrates the advisory-disguised case. The second is the genuinely informational counter-case. tool_choice forces the model into the tool, temperature suppresses drift on judgement-class outputs. The Anthropic tool-use docs cover the discipline.

11 · READ THE SOURCE

Read the source directly.