/blog · the architecture argument for the evidence pipeline

ARCHITECTURE · PILLAR · FUNCTIONS OVER GRAPHS
PUBLISHED 2026-05-08 · ~10-MIN READ · WARRANT ENGINEERING

no agentic framework. just functions.

warrant runs zero LangGraph, zero AutoGen, zero CrewAI. four functions, two models, one pipeline. the architectural choice that lets us cite sub-clauses with discipline.

PIPELINE
4 functions· Classify · Extract · Assess · Map
Four named, typed, sequential function calls. Single-agent-output. Bounded depth.
DEPS
zero LangGraph· zero AutoGen · zero CrewAI
Native Anthropic API + tool use. The pipeline is the orchestrator.
ORCHESTRATOR
12 lines· Python
One def run(trace) -> EvidencePackage. Typed boundaries. The compiler is the control-flow graph.
W
PIPELINE · 4 FUNCTIONS · 2 MODELS · 9 REGIMES
Opus 4.7 + Sonnet 4.6 over the Anthropic SDK · Pydantic at every boundary · the production architecture as of 2026-05-08.

The compliance-for-AI category is full of products built on agent orchestration libraries. LangGraph for the graph. AutoGen for the multi-agent chat. CrewAI for the role-based crew. semantic-kernel for the planner. Each library promises composition. Each library, in our experience, delivers coupling. Three things you do not need leak in alongside the part you wanted: hidden control flow, hidden retry semantics, and hidden token spend.

Warrant's pipeline is four named Python functions calling the Anthropic API. Each function takes a typed input, returns a typed output, and writes one structured log line per call. A junior engineer reads the orchestrator in twelve lines and understands the system. The auditor's question, which exact stage produced this citation, has a one-line answer in the trace.

This post is the architectural argument for that choice. Not a takedown of any specific library. A defence of explicit functions and typed edges, in the bounded-depth setting where evidence is the deliverable.

01 · ACCIDENTAL COMPLEXITY

The library as accidental complexity.

Three-stage agentic pipeline · LangGraph version vs plain Python · same output, different surface area

Take a three-stage pipeline. Classify a trace, extract actions from it, assess each action against a policy. The same task, written two ways. The graph version is roughly fifty lines. The plain version is roughly twenty-five. Both produce identical output on a fixed trace.

python · graph executor version
from langgraph.graph import StateGraph, END
from langchain_anthropic import ChatAnthropic
from typing import TypedDict, List

class State(TypedDict):
    trace: dict
    classification: dict
    actions: List[dict]
    findings: List[dict]

llm = ChatAnthropic(model="claude-opus-4-7")
fast_llm = ChatAnthropic(model="claude-sonnet-4-6")

def classify_node(state):
    msg = llm.invoke([{"role": "user", "content": f"classify: {state['trace']}"}])
    return {"classification": msg.content}

def extract_node(state):
    msg = fast_llm.invoke([{"role": "user", "content": f"extract actions from: {state['trace']}"}])
    return {"actions": msg.content}

def assess_node(state):
    msg = llm.invoke([{"role": "user", "content": f"assess: {state['actions']}"}])
    return {"findings": msg.content}

graph = StateGraph(State)
graph.add_node("classify", classify_node)
graph.add_node("extract", extract_node)
graph.add_node("assess", assess_node)
graph.set_entry_point("classify")
graph.add_edge("classify", "extract")
graph.add_edge("extract", "assess")
graph.add_edge("assess", END)

app = graph.compile()
result = app.invoke({"trace": trace})
# where did the retry happen? what was the wall-clock per node?
# the executor knows. you have to read its source to know.

The graph compiles. The output is correct. When extract returns malformed JSON, the executor retries with some backoff schedule and surfaces the result. When assess hits a rate limit, the same executor decides what to do. When you ask how many tokens did this trace cost, the answer lives behind a callback handler you have to register.

python · plain functions version
from anthropic import Anthropic

client = Anthropic()

def classify(trace: dict) -> dict:
    r = client.messages.create(
        model="claude-opus-4-7", max_tokens=2048,
        messages=[{"role": "user", "content": f"classify: {trace}"}])
    return {"text": r.content[0].text, "usage": r.usage.model_dump()}

def extract(trace: dict) -> dict:
    r = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=2048,
        messages=[{"role": "user", "content": f"extract actions: {trace}"}])
    return {"text": r.content[0].text, "usage": r.usage.model_dump()}

def assess(actions: dict) -> dict:
    r = client.messages.create(
        model="claude-opus-4-7", max_tokens=2048,
        messages=[{"role": "user", "content": f"assess: {actions}"}])
    return {"text": r.content[0].text, "usage": r.usage.model_dump()}

classification = classify(trace)
actions = extract(trace)
findings = assess(actions)
# every retry is yours. every token count is in the return value.
# the call site is the control flow.

Same output. Half the lines. The retry semantics are explicit, because there are none yet. When you decide to add them, you wrap one function call in a four-line retry loop with the policy you actually want. Token spend is in the return value, not in a callback. The control flow is the call site. The auditor question which model produced which output has a one-line answer.

This is what accidental complexity means. The graph executor solves the general problem of arbitrary graph traversal, and the price of that generality is paid in every node, every run.

02 · THE FOUR FUNCTIONS

The four functions that ship the artefact.

api/pipelines/*.py · the production analysis path · zero hidden state, zero hidden retries

The Warrant analysis path is exactly four functions. Each takes a typed input, returns a typed output, writes one structured log line per call. The orchestrator is twelve lines.

01 · CLASSIFY
classify_trace
trace → ClassificationResult. Domain, jurisdictions, regimes, risk tier. Opus 4.7.
02 · EXTRACT
extract_actions
trace, classification → List[Action]. Tool-use schema, deterministic JSON. Sonnet 4.6.
03 · ASSESS
assess_authorization
actions → List[AuthorizationFinding]. Within purpose, preconditions met, oversight, reversibility. Opus 4.7.
04 · MAP
map_obligations
actions, findings → EvidencePackage. Sub-clause citations across nine frameworks, six jurisdictions. Sonnet 4.6.

The signatures are written down in one Pydantic file. The compiler, mypy, and pyright all read it. Misuse is caught at edit time, not at the regulator's desk.

python · types.py
from pydantic import BaseModel, Field
from typing import Literal, List
from datetime import datetime

class ClassificationResult(BaseModel):
    domain: Literal["lending", "advisory", "kyc", "trading"]
    jurisdictions: List[Literal["EU", "IN", "US", "UK", "SG", "AU"]]
    regimes: List[str]
    risk_tier: Literal["low", "limited", "high", "unacceptable"]
    confidence: float = Field(ge=0.0, le=1.0)

class Action(BaseModel):
    actor: str
    action: str
    subject: str
    inputs: dict
    outputs: dict
    ts: datetime

class AuthorizationFinding(BaseModel):
    action_index: int
    within_purpose: bool
    preconditions_met: bool
    human_oversight_appropriate: bool
    reversible: bool
    justification: str

class EvidencePackage(BaseModel):
    classification: ClassificationResult
    actions: List[Action]
    findings: List[AuthorizationFinding]
    obligations: List["ObligationMapping"]
    package_id: str
    ts: datetime

Pure-ish is the right word. The Anthropic API call is a side effect, and so is the structured log line. Both are explicit and visible. No hidden state machine, no callback handler buried in a base class. The function does what it says.

python · the orchestrator, in full
def run(trace: dict) -> EvidencePackage:
    classification = classify_trace(trace)
    actions = extract_actions(trace, classification)
    findings = assess_authorization(actions)
    package = map_obligations(actions, findings)
    pdf_bytes = render_pdf(package)
    record = finalize_record(pdf_bytes, package.package_id)
    log("pipeline.complete", package_id=package.package_id,
        wall_ms=elapsed_ms(),
        cost_usd=total_cost(classification, actions, findings, package))
    return record

Twelve lines. Nothing hidden. The auditor question show me the exact code path that produced this citation resolves to one file, four function names, one ordered list. The European framing in Article 12(1) of the AI Act, on automatic recording of events over the lifetime of the system, leaves no room for an answer like well, the executor decided. Every stage logs. Every stage names itself.

"High-risk AI systems shall technically allow for the automatic recording of events (logs) over the lifetime of the system." EU AI Act · Article 12(1)

The word lifetime is the part that dictates the architecture. A system whose internal control flow is opaque cannot guarantee that every event is recorded. A system whose control flow is the call site can.

03 · TWO MODELS

Why two models, not one.

Opus 4.7 for judgement · Sonnet 4.6 for structured output · the cost-quality split

The pipeline routes by stage character. Classify and Assess require judgement. Distinguishing an informational action from an advisory one inside an advisory context is the kind of distinction Opus 4.7 makes consistently and Sonnet 4.6 sometimes does not. Mapping a specific action to a specific sub-clause out of nine frameworks across six jurisdictions is the same shape of problem, dense legal text, multi-hop inference, narrow correctness. Opus carries those.

Extract and Map are structured-output stages. Tool-use schema in, deterministic JSON out. Sonnet 4.6 is the right model for that work. Faster, cheaper, the structured-output reliability is what is on test, not the legal reasoning.

Stage Model Why Per-trace cost
01 · classify_trace Opus 4.7 Judgement, multi-jurisdiction routing ~$0.18
02 · extract_actions Sonnet 4.6 Structured tool-use schema, JSON ~$0.04
03 · assess_authorization Opus 4.7 Per-action authorisation reasoning ~$0.32
04 · map_obligations Sonnet 4.6 Sub-clause matching with retrieval ~$0.06

The split lands at roughly sixty cents an analysis on the average production trace. An all-Opus pipeline runs near a dollar fifty for the same output, with marginal accuracy gain on the structured stages. An all-Sonnet pipeline runs under twenty cents and loses meaningfully on the judgement stages. Two models, two roles, the right tool in each hand.

The routing is not a feature flag. It is a constant per stage. MODEL_JUDGEMENT = "claude-opus-4-7", MODEL_STRUCTURED = "claude-sonnet-4-6". When Anthropic ships a new Sonnet, one string changes. The eval suite reruns. The decision is data, not engineering.

04 · COMPOSITION VS COUPLING

Libraries promise composition. They deliver coupling.

The model upgrade test · two changelogs versus one string · the maintenance asymmetry

Every orchestration library promises composition. Drop in a new node, swap a model, change a tool, the graph stays intact. The reality of running these libraries through model upgrades, in production, is that the surface they expose is wider than the surface you wanted, and the wider surface couples you to their release cadence.

When Anthropic ships Sonnet 4.6, the upgrade in the four-function pipeline is one constant. MODEL_STRUCTURED = "claude-sonnet-4-6". Run the eval suite. Diff per-stage. Decide. Total elapsed time, the eight minutes the eval takes, plus the human read of the diff.

When the library upgrades, you read its CHANGELOG. Release notes for the executor, schema migration for the state object, deprecation list for tool decorators, breaking changes for callbacks. Maybe you rewrite three nodes because the state interface changed. Maybe a retry default flipped, your bill silently doubles for a week, and you find out from the cost dashboard. The model upgrade and the library upgrade are independent axes of breakage. You signed up for both.

This is what Eric Evans means by boundaries. Architecture is a series of decisions about which boundaries to draw, where, and what passes across them. The boundary between the analysis pipeline and the model API has one decision in it, the model id. The boundary between the analysis pipeline and an orchestration library has hundreds of decisions in it, embedded in the library's API surface, most of which were not made by you.

DO

Couple to the model API directly. Couple to your own type definitions.

Two surfaces. Both stable. Vendor controls one, you control the other. Upgrading either is a single-axis decision.

DON'T

Couple to a graph executor that wraps the model API.

Three surfaces. One yours, two vendor. The vendor library tracks model changes on its own schedule, and you absorb every drift event downstream of both.

05 · TYPED EDGES

The compiler is your control-flow graph.

Pydantic at every boundary · mypy and pyright as the visualiser

An orchestration library shows you a graph in a notebook. Nodes connected by arrows, the visualisation as the documentation. The picture is appealing and the run-time semantics are not always what it says.

Pydantic models at every boundary make the type checker the graph. classify_trace takes dict and returns ClassificationResult. extract_actions takes dict and ClassificationResult, returns List[Action]. The shape of the pipeline is the signatures, checked on every save and every push. Wire extract_actions into a position that expects List[AuthorizationFinding], mypy catches it before the function runs.

This property does not show up on a feature checklist. It shows up the first time a junior engineer changes a stage signature and the editor lights up red across nine call sites. The pipeline tells you, at edit time, what it expects.

python · the type-checked seam
def map_obligations(
    actions: List[Action],
    findings: List[AuthorizationFinding],
) -> EvidencePackage:
    # actions and findings have stable shapes. the function body
    # can index, iterate, and select fields without runtime guards.
    # pyright catches a wrong shape at the call site, not at midnight.
    ...

# wrong wiring, caught by pyright before the function runs:
package = map_obligations(findings, actions)
# error: argument of type "list[AuthorizationFinding]" is not
# assignable to parameter "actions" of type "list[Action]"

The deeper claim is that the compiler is the right place to put a control-flow graph. A control-flow graph that lives in a library's runtime has to be re-checked every run. A control-flow graph that lives in the type system is checked once, in the editor, and stays checked.

06 · WHEN A FRAMEWORK HELPS

Where libraries earn their complexity.

The honest trade-off · open-ended planning, large tool ecosystems, human-in-loop interventions

The argument is not that orchestration libraries are wrong. They are right for a different problem.

Three properties make a library's overhead pay off. Unbounded planning depth, where the agent decides at run time how many steps to take and which tools to call. Large tool ecosystems, where the integration cost of fifty tools dwarfs the cost of the orchestration layer. Many human-in-loop interventions, where pausing the graph, presenting state to a human, and resuming on input is the dominant runtime concern.

FRAMEWORK FITS

Open-ended research agents. Unbounded planning depth, tool selection at run time, dynamic graph shape. The graph executor earns its complexity.

Tool-rich operator agents. Fifty plus tools, the integration manifest is the work, the orchestration overhead is small in comparison.

Human-in-loop workflows. The pause-and-resume pattern, with state inspection and manual approval, is the right thing to factor out.

FRAMEWORK MISFIT

Bounded depth. Warrant is always four stages. The depth is in the prompt and the type, not in the run-time graph.

Bounded tools. Anthropic API for the model. Postgres for retrieval. Two surfaces, neither of which benefits from a tool registry.

Zero human interventions per trace. A record mapped to a specific EU AI Act obligation is the deliverable. There is no human approval gate inside the run.

The shape of the problem decides the shape of the architecture. An open-ended research agent and a four-stage evidence pipeline are not the same product, and they should not share the same scaffolding. The library earns its keep where the run-time graph is the thing being modelled. It is overhead where the run-time graph is fixed, the tools are two, and the deliverable is a single record that is independently verifiable without contacting Warrant.

07 · EVIDENCE OVER ARCHITECTURE

The regulator does not care about your DAG.

Architecture serves the artefact · the artefact is what gets cited in court · the rest is implementation detail

An auditor reading a Warrant evidence package never sees the four-function pipeline. They see a record mapped to a specific EU AI Act obligation, with a sub-clause citation, independently verifiable without contacting Warrant. The architecture is invisible. That is the point.

The reason the architecture has to be small, named, and explicit is that the artefact has to be defensible. Show me the citation, show me the action that triggered it, show me the model that produced the assessment, show me the timestamp. Each question resolves to one row in one log table. The architecture choice is the precondition for that resolvability.

A graph executor with hidden retries can produce the same artefact, on a good day. On a bad day, the executor retried four times under load, your token bill quadrupled silently, and the response was the latest of the four runs. The artefact looks identical. The integrity story is not.

The rule, written down inside the engineering team in one line, is architecture serves the artefact. Every decision in the pipeline is justified against that rule. The four functions exist because four named stages map cleanly to how a regulator reasons about an AI system. The two models exist because judgement and structured output are different problems with different cost curves. The Pydantic types exist because the artefact citations have to be checked at edit time, not at the auditor's desk. The orchestrator is twelve lines because every line in it is one the auditor can read.

Nothing else is in the system. There is no graph object, no executor, no callback handler, no state machine, no retry decorator with a hidden policy. Four functions, two models, one pipeline. The architecture decision is the citation discipline.

08 · FAQ

Questions an architect asks first.

FAQ · sourced from inbound from platform and ML engineering teams during the May 2026 launch week
Do you use any LangChain modules?

No. Zero LangChain, zero LangGraph, zero CrewAI, zero AutoGen, zero semantic-kernel. The pipeline is four Python functions calling the Anthropic SDK directly. Pydantic for typed edges, Postgres for retrieval. That is the entire dependency surface for the analysis path.

What about retrieval. Surely that is a framework.

Retrieval is one Pydantic-typed function, retrieve_clauses(query: str, jurisdiction: Jurisdiction) -> List[Clause]. It embeds the query with one API call, runs a SELECT against a Postgres pgvector index, returns a typed list. About forty lines including the type definitions. There is no orchestration layer, no chain object, no retriever class holding state. A library is not a framework. A function that calls a vector index is not a framework.

Does this scale to multi-agent workflows?

Not the goal. Warrant's pipeline is single-agent-output. One trace produces one evidence record, independently verifiable without contacting Warrant. The four stages are not agents talking to each other, they are functions composing into a deterministic transformation. Multi-agent orchestration is a different product category. If you need open-ended planning or N agents debating in a loop, an orchestration layer earns its complexity. The bounded-depth case does not.

How do you test the pipeline end-to-end?

A regression set of two hundred plus labelled traces, each with golden outputs at every stage. Nightly evaluation runs all four stages on every trace, diffs the actual output against the golden, and flags any drift in classification, action extraction, authorisation assessment, or obligation mapping. Per-stage golden outputs mean a regression in stage three does not look like a regression in stage four. The full suite runs in under nine minutes on one workstation.

What if Anthropic releases a new model?

The model id changes in one place. There is a constants module with two strings, MODEL_JUDGEMENT and MODEL_STRUCTURED. Update the string, run the eval suite, compare deltas, decide. No library upgrade, no CHANGELOG to read, no graph executor to test for behaviour drift. The eval suite re-runs in eight minutes and produces a per-stage delta table.

Can i use this style with LangChain or the OpenAI SDK?

Yes. The architecture choice is independent of vendor. The point is explicit functions, typed edges, no hidden control flow. You can write a four-function pipeline against the OpenAI SDK, against Bedrock, against the LangChain core primitives, against any HTTP client. The argument is not anti-LangChain. The argument is anti-hidden-graph. If your problem is bounded depth, write functions. If it is not, do not write functions.

09 · READ THE SOURCE

Read the source directly.

SEE THE PIPELINE RUN

Drop a trace. Read the four log lines.

The fastest way to read the architecture is to run a sample trace through the live demo. Four log lines, one per stage, model id and token cost on each. The pipeline you have just read about, in production, on a real trace.