The compliance-for-AI category is full of products built on agent orchestration libraries. LangGraph for the graph. AutoGen for the multi-agent chat. CrewAI for the role-based crew. semantic-kernel for the planner. Each library promises composition. Each library, in our experience, delivers coupling. Three things you do not need leak in alongside the part you wanted: hidden control flow, hidden retry semantics, and hidden token spend.
Warrant's pipeline is four named Python functions calling the Anthropic API. Each function takes a typed input, returns a typed output, and writes one structured log line per call. A junior engineer reads the orchestrator in twelve lines and understands the system. The auditor's question, which exact stage produced this citation, has a one-line answer in the trace.
This post is the architectural argument for that choice. Not a takedown of any specific library. A defence of explicit functions and typed edges, in the bounded-depth setting where evidence is the deliverable.
The library as accidental complexity.
Take a three-stage pipeline. Classify a trace, extract actions from it, assess each action against a policy. The same task, written two ways. The graph version is roughly fifty lines. The plain version is roughly twenty-five. Both produce identical output on a fixed trace.
from langgraph.graph import StateGraph, END from langchain_anthropic import ChatAnthropic from typing import TypedDict, List class State(TypedDict): trace: dict classification: dict actions: List[dict] findings: List[dict] llm = ChatAnthropic(model="claude-opus-4-7") fast_llm = ChatAnthropic(model="claude-sonnet-4-6") def classify_node(state): msg = llm.invoke([{"role": "user", "content": f"classify: {state['trace']}"}]) return {"classification": msg.content} def extract_node(state): msg = fast_llm.invoke([{"role": "user", "content": f"extract actions from: {state['trace']}"}]) return {"actions": msg.content} def assess_node(state): msg = llm.invoke([{"role": "user", "content": f"assess: {state['actions']}"}]) return {"findings": msg.content} graph = StateGraph(State) graph.add_node("classify", classify_node) graph.add_node("extract", extract_node) graph.add_node("assess", assess_node) graph.set_entry_point("classify") graph.add_edge("classify", "extract") graph.add_edge("extract", "assess") graph.add_edge("assess", END) app = graph.compile() result = app.invoke({"trace": trace}) # where did the retry happen? what was the wall-clock per node? # the executor knows. you have to read its source to know.
The graph compiles. The output is correct. When extract returns malformed JSON, the executor retries with some backoff schedule and surfaces the result. When assess hits a rate limit, the same executor decides what to do. When you ask how many tokens did this trace cost, the answer lives behind a callback handler you have to register.
from anthropic import Anthropic client = Anthropic() def classify(trace: dict) -> dict: r = client.messages.create( model="claude-opus-4-7", max_tokens=2048, messages=[{"role": "user", "content": f"classify: {trace}"}]) return {"text": r.content[0].text, "usage": r.usage.model_dump()} def extract(trace: dict) -> dict: r = client.messages.create( model="claude-sonnet-4-6", max_tokens=2048, messages=[{"role": "user", "content": f"extract actions: {trace}"}]) return {"text": r.content[0].text, "usage": r.usage.model_dump()} def assess(actions: dict) -> dict: r = client.messages.create( model="claude-opus-4-7", max_tokens=2048, messages=[{"role": "user", "content": f"assess: {actions}"}]) return {"text": r.content[0].text, "usage": r.usage.model_dump()} classification = classify(trace) actions = extract(trace) findings = assess(actions) # every retry is yours. every token count is in the return value. # the call site is the control flow.
Same output. Half the lines. The retry semantics are explicit, because there are none yet. When you decide to add them, you wrap one function call in a four-line retry loop with the policy you actually want. Token spend is in the return value, not in a callback. The control flow is the call site. The auditor question which model produced which output has a one-line answer.
This is what accidental complexity means. The graph executor solves the general problem of arbitrary graph traversal, and the price of that generality is paid in every node, every run.
The four functions that ship the artefact.
The Warrant analysis path is exactly four functions. Each takes a typed input, returns a typed output, writes one structured log line per call. The orchestrator is twelve lines.
The signatures are written down in one Pydantic file. The compiler, mypy, and pyright all read it. Misuse is caught at edit time, not at the regulator's desk.
from pydantic import BaseModel, Field from typing import Literal, List from datetime import datetime class ClassificationResult(BaseModel): domain: Literal["lending", "advisory", "kyc", "trading"] jurisdictions: List[Literal["EU", "IN", "US", "UK", "SG", "AU"]] regimes: List[str] risk_tier: Literal["low", "limited", "high", "unacceptable"] confidence: float = Field(ge=0.0, le=1.0) class Action(BaseModel): actor: str action: str subject: str inputs: dict outputs: dict ts: datetime class AuthorizationFinding(BaseModel): action_index: int within_purpose: bool preconditions_met: bool human_oversight_appropriate: bool reversible: bool justification: str class EvidencePackage(BaseModel): classification: ClassificationResult actions: List[Action] findings: List[AuthorizationFinding] obligations: List["ObligationMapping"] package_id: str ts: datetime
Pure-ish is the right word. The Anthropic API call is a side effect, and so is the structured log line. Both are explicit and visible. No hidden state machine, no callback handler buried in a base class. The function does what it says.
def run(trace: dict) -> EvidencePackage: classification = classify_trace(trace) actions = extract_actions(trace, classification) findings = assess_authorization(actions) package = map_obligations(actions, findings) pdf_bytes = render_pdf(package) record = finalize_record(pdf_bytes, package.package_id) log("pipeline.complete", package_id=package.package_id, wall_ms=elapsed_ms(), cost_usd=total_cost(classification, actions, findings, package)) return record
Twelve lines. Nothing hidden. The auditor question show me the exact code path that produced this citation resolves to one file, four function names, one ordered list. The European framing in Article 12(1) of the AI Act, on automatic recording of events over the lifetime of the system, leaves no room for an answer like well, the executor decided. Every stage logs. Every stage names itself.
The word lifetime is the part that dictates the architecture. A system whose internal control flow is opaque cannot guarantee that every event is recorded. A system whose control flow is the call site can.
Why two models, not one.
The pipeline routes by stage character. Classify and Assess require judgement. Distinguishing an informational action from an advisory one inside an advisory context is the kind of distinction Opus 4.7 makes consistently and Sonnet 4.6 sometimes does not. Mapping a specific action to a specific sub-clause out of nine frameworks across six jurisdictions is the same shape of problem, dense legal text, multi-hop inference, narrow correctness. Opus carries those.
Extract and Map are structured-output stages. Tool-use schema in, deterministic JSON out. Sonnet 4.6 is the right model for that work. Faster, cheaper, the structured-output reliability is what is on test, not the legal reasoning.
| Stage | Model | Why | Per-trace cost |
|---|---|---|---|
| 01 · classify_trace | Opus 4.7 | Judgement, multi-jurisdiction routing | ~$0.18 |
| 02 · extract_actions | Sonnet 4.6 | Structured tool-use schema, JSON | ~$0.04 |
| 03 · assess_authorization | Opus 4.7 | Per-action authorisation reasoning | ~$0.32 |
| 04 · map_obligations | Sonnet 4.6 | Sub-clause matching with retrieval | ~$0.06 |
The split lands at roughly sixty cents an analysis on the average production trace. An all-Opus pipeline runs near a dollar fifty for the same output, with marginal accuracy gain on the structured stages. An all-Sonnet pipeline runs under twenty cents and loses meaningfully on the judgement stages. Two models, two roles, the right tool in each hand.
The routing is not a feature flag. It is a constant per stage. MODEL_JUDGEMENT = "claude-opus-4-7", MODEL_STRUCTURED = "claude-sonnet-4-6". When Anthropic ships a new Sonnet, one string changes. The eval suite reruns. The decision is data, not engineering.
Libraries promise composition. They deliver coupling.
Every orchestration library promises composition. Drop in a new node, swap a model, change a tool, the graph stays intact. The reality of running these libraries through model upgrades, in production, is that the surface they expose is wider than the surface you wanted, and the wider surface couples you to their release cadence.
When Anthropic ships Sonnet 4.6, the upgrade in the four-function pipeline is one constant. MODEL_STRUCTURED = "claude-sonnet-4-6". Run the eval suite. Diff per-stage. Decide. Total elapsed time, the eight minutes the eval takes, plus the human read of the diff.
When the library upgrades, you read its CHANGELOG. Release notes for the executor, schema migration for the state object, deprecation list for tool decorators, breaking changes for callbacks. Maybe you rewrite three nodes because the state interface changed. Maybe a retry default flipped, your bill silently doubles for a week, and you find out from the cost dashboard. The model upgrade and the library upgrade are independent axes of breakage. You signed up for both.
This is what Eric Evans means by boundaries. Architecture is a series of decisions about which boundaries to draw, where, and what passes across them. The boundary between the analysis pipeline and the model API has one decision in it, the model id. The boundary between the analysis pipeline and an orchestration library has hundreds of decisions in it, embedded in the library's API surface, most of which were not made by you.
Couple to the model API directly. Couple to your own type definitions.
Two surfaces. Both stable. Vendor controls one, you control the other. Upgrading either is a single-axis decision.
Couple to a graph executor that wraps the model API.
Three surfaces. One yours, two vendor. The vendor library tracks model changes on its own schedule, and you absorb every drift event downstream of both.
The compiler is your control-flow graph.
An orchestration library shows you a graph in a notebook. Nodes connected by arrows, the visualisation as the documentation. The picture is appealing and the run-time semantics are not always what it says.
Pydantic models at every boundary make the type checker the graph. classify_trace takes dict and returns ClassificationResult. extract_actions takes dict and ClassificationResult, returns List[Action]. The shape of the pipeline is the signatures, checked on every save and every push. Wire extract_actions into a position that expects List[AuthorizationFinding], mypy catches it before the function runs.
This property does not show up on a feature checklist. It shows up the first time a junior engineer changes a stage signature and the editor lights up red across nine call sites. The pipeline tells you, at edit time, what it expects.
def map_obligations( actions: List[Action], findings: List[AuthorizationFinding], ) -> EvidencePackage: # actions and findings have stable shapes. the function body # can index, iterate, and select fields without runtime guards. # pyright catches a wrong shape at the call site, not at midnight. ... # wrong wiring, caught by pyright before the function runs: package = map_obligations(findings, actions) # error: argument of type "list[AuthorizationFinding]" is not # assignable to parameter "actions" of type "list[Action]"
The deeper claim is that the compiler is the right place to put a control-flow graph. A control-flow graph that lives in a library's runtime has to be re-checked every run. A control-flow graph that lives in the type system is checked once, in the editor, and stays checked.
Where libraries earn their complexity.
The argument is not that orchestration libraries are wrong. They are right for a different problem.
Three properties make a library's overhead pay off. Unbounded planning depth, where the agent decides at run time how many steps to take and which tools to call. Large tool ecosystems, where the integration cost of fifty tools dwarfs the cost of the orchestration layer. Many human-in-loop interventions, where pausing the graph, presenting state to a human, and resuming on input is the dominant runtime concern.
Open-ended research agents. Unbounded planning depth, tool selection at run time, dynamic graph shape. The graph executor earns its complexity.
Tool-rich operator agents. Fifty plus tools, the integration manifest is the work, the orchestration overhead is small in comparison.
Human-in-loop workflows. The pause-and-resume pattern, with state inspection and manual approval, is the right thing to factor out.
Bounded depth. Warrant is always four stages. The depth is in the prompt and the type, not in the run-time graph.
Bounded tools. Anthropic API for the model. Postgres for retrieval. Two surfaces, neither of which benefits from a tool registry.
Zero human interventions per trace. A record mapped to a specific EU AI Act obligation is the deliverable. There is no human approval gate inside the run.
The shape of the problem decides the shape of the architecture. An open-ended research agent and a four-stage evidence pipeline are not the same product, and they should not share the same scaffolding. The library earns its keep where the run-time graph is the thing being modelled. It is overhead where the run-time graph is fixed, the tools are two, and the deliverable is a single record that is independently verifiable without contacting Warrant.
The regulator does not care about your DAG.
An auditor reading a Warrant evidence package never sees the four-function pipeline. They see a record mapped to a specific EU AI Act obligation, with a sub-clause citation, independently verifiable without contacting Warrant. The architecture is invisible. That is the point.
The reason the architecture has to be small, named, and explicit is that the artefact has to be defensible. Show me the citation, show me the action that triggered it, show me the model that produced the assessment, show me the timestamp. Each question resolves to one row in one log table. The architecture choice is the precondition for that resolvability.
A graph executor with hidden retries can produce the same artefact, on a good day. On a bad day, the executor retried four times under load, your token bill quadrupled silently, and the response was the latest of the four runs. The artefact looks identical. The integrity story is not.
The rule, written down inside the engineering team in one line, is architecture serves the artefact. Every decision in the pipeline is justified against that rule. The four functions exist because four named stages map cleanly to how a regulator reasons about an AI system. The two models exist because judgement and structured output are different problems with different cost curves. The Pydantic types exist because the artefact citations have to be checked at edit time, not at the auditor's desk. The orchestrator is twelve lines because every line in it is one the auditor can read.
Nothing else is in the system. There is no graph object, no executor, no callback handler, no state machine, no retry decorator with a hidden policy. Four functions, two models, one pipeline. The architecture decision is the citation discipline.
Questions an architect asks first.
Read the source directly.
- EU AI Act, Article 12, automatic recording of events over the system's lifetime →
- Anthropic API reference, the surface the pipeline calls directly →
- Pydantic, typed boundaries between stages →
- Chris Richardson, on bounded contexts and pattern selection →
- Martin Fowler, on architecture as a series of boundary decisions →
Drop a trace. Read the four log lines.
The fastest way to read the architecture is to run a sample trace through the live demo. Four log lines, one per stage, model id and token cost on each. The pipeline you have just read about, in production, on a real trace.