/blog · the engineering closure between stochastic models and deterministic audits

ENTRY № 16 · ENGINEERING · STOCHASTIC EVIDENCE
PUBLISHED 2026-05-09 · ~12-MIN READ · WARRANT ENGINEERING

the stochastic evidence problem.

An LLM is non-deterministic by design. A regulator audit is deterministic by definition. Producing evidence that survives the audit-replay test from a stochastic model is the central engineering problem of agentic compliance. Four moves close the gap: trace canonicalisation under RFC 8785, deterministic decoding, eval-set anchoring, and per-decision residual-uncertainty disclosure.

THE PROBLEM
stochastic ⇒ deterministic· replay-test failure
An auditor who runs the same trace through your pipeline twice and gets different evidence has caught you. The model is non-deterministic; the evidence must not be.
THE FOUR MOVES
canon · seed · eval · uncertainty· engineering closure
RFC 8785 JCS canonicalisation · deterministic decoding · eval-set anchoring · per-decision residual-uncertainty disclosure. Each closes one part of the gap.
THE PROOF
200 traces · κ 0.84· 99.7% citation precision
Inter-annotator κ 0.84 on the labelled regression set. Citation precision 99.7% on canonical regulator text. Replay-stability proven across all 200 traces over 3 months.
W
FOUR MOVES · 200 TRACES · 90 DAYS · 100% TRACE-HASH STABILITY
RFC 8785 JCS canonicalisation, deterministic decoding, eval-anchoring, residual-uncertainty disclosure · the engineering closure between a stochastic model and a deterministic artefact.
01 · THE AUDIT-REPLAY TEST

Hand my staff the same trace. Do they produce the same evidence?

The regulator's mental model · deterministic by definition · the load-bearing claim of the entire engineering effort

The regulator's mental model of an audit is simple. Hand my staff the same trace you handed your pipeline, and the staff should produce the same evidence claims. If they do not, your evidence is not evidence, it is a probabilistic guess that happened to land somewhere defensible the day you ran it. The auditor does not need to be hostile to surface this failure mode. They only need a second machine and a Tuesday.

Most engineering effort in this category is spent shipping fast, not reproducible. The Anthropic SDK call returns a structured response, the prompt looked right in staging, the PDF rendered, ship it. A replay test was never wired up. The pipeline is one-shot, the trace logs are present, and any second run is presumed to behave identically because the inputs are the same. They do not behave identically. Two API calls into Claude with the same trace, the same prompt, and the same model id can and do return different bytes. Sometimes the difference is one whitespace token, sometimes it is one citation, sometimes it is one risk-tier classification flipped from limited to high.

The audit-replay test is the simplest possible discipline a regulated AI product can hold itself to. Take an attested package, ingest its trace through the pipeline a second time, and assert that the evidence-bearing artefact is byte-equivalent to the first run on every claim that does not declare itself probabilistic. This is the load-bearing claim of the whole engineering effort. Everything that follows in this post is in service of one outcome, that the audit-replay test passes deterministically on every package Warrant has ever attested.

Determinism is not a property of the model. It is a property of the evidence record produced from the model. The shape of the engineering closure · why the four moves matter
02 · WHY LLMS FAIL THE REPLAY TEST

Non-determinism comes from at least four places.

Sampling temperature · GPU kernel non-determinism · provider-side load balancing · prompt-cache state

An LLM API call is not a function call. It is a network request to a fleet of inference servers, each running attention kernels on a non-deterministic GPU, behind a load balancer that picks a replica without telling you which one, with a prompt-cache layer that may or may not be warm. None of that is exposed in the SDK signature. The signature looks pure. The execution is not.

Sampling temperature above zero

Most defaults set temperature near 1.0. The next-token sampler is randomised, the seed is unpinned, every call samples a different completion. This is the variance source the API consumer can see. It is also the smallest one.

GPU kernel non-determinism in attention

CUDA reductions in attention are order-dependent. Two runs of the same kernel on the same GPU, with the same inputs, can produce slightly different floating-point outputs because the reduction order across thread blocks is not guaranteed. The variance is small per token, but it can accumulate into a different next-token argmax.

Provider-side load balancing

The provider routes a request to whichever replica has spare capacity. Replicas may be on different hardware revisions, with different kernel libraries, with different floating-point reduction strategies. The routing is invisible to the caller. Two identical API calls one millisecond apart can land on two different replicas.

Prompt-cache state

Anthropic's prompt-caching feature can change response timing and, in rare race conditions where cache-eviction interacts with concurrent requests, can produce different tokens. The cache_hit field is exposed in the response, but the caller has to log it explicitly to reason about replay divergences after the fact.

The Anthropic API documentation is candid about each of these. The temperature parameter is documented to randomise sampling. The seed parameter is documented as best-effort, not a strict guarantee. The prompt-cache documentation calls out that cache state can affect responses in edge conditions. The OpenAI API documents the same set of behaviours under different names. Without engineering intervention, two identical API calls can return different outputs. Treating that fact as a property of the system is the start of building around it.

The engineering question is not how do we make the LLM deterministic. The engineering question is given an LLM that is not deterministic, how do we produce evidence that is. The four moves below are the answer.

03 · MOVE 1 · TRACE CANONICALISATION (RFC 8785)

Canonical bytes are the precondition for everything.

RFC 8785 JSON Canonicalization Scheme · Unicode NFC · IEEE 754 number form · recursive sorting at every depth

Before any hashing or comparison happens, the JSON record has to land in a canonical form. The JSON Canonicalization Scheme defined in RFC 8785 is the answer the standards community has converged on. The opening of the spec is exact about what canonicalisation means.

"Cryptographic operations like hashing and signing depend on the data being expressed in an invariant format so that the operations are reliably reproducible. JCS specifies a canonical and unique JSON representation that can be used as input to cryptographic methods." RFC 8785 · §1 · Introduction

JCS produces a unique deterministic byte serialisation of equivalent JSON values. Two documents that mean the same thing serialise to the same bytes; two documents that differ in any meaningful way serialise to different bytes. The properties JCS pins down are the ones that vary across naive encoders: Unicode normalisation form (NFC), IEEE 754 number form, member-name sorting at every nesting depth, escape-sequence canonicalisation, and whitespace stripping. The result is a byte string that is independently verifiable without contacting Warrant.

python · the canonical form is a one-liner
import rfc8785
import hashlib

# JCS canonicalisation: stable across Python versions, OS locale, dict-insertion
# order. Handles Unicode NFC normalisation, IEEE 754 number form, sorted keys
# at every nesting depth. The standard JOSE/COSE/JWS implementations have
# converged on JCS for exactly this reason.
canonical_bytes = rfc8785.dumps(trace_dict)
trace_hash = hashlib.sha256(canonical_bytes).hexdigest()

# trace_hash is now a function of the JSON value, not of the encoding.
# Two engineers on different machines, different Python versions,
# different locales, get the same hash for the same logical trace.

The temptation, when an engineer first touches this problem, is to reach for json.dumps(trace, sort_keys=True). That is closer to canonical than the default, and it is not enough. The Python stdlib does not normalise Unicode (so a precomposed é hashes differently from a decomposed e + combining acute). It does not normalise floats (so 1.0 and 1 serialise differently in the source-numeric path). It does not handle insertion-ordered dict subtypes consistently across Python versions. It does not strip insignificant whitespace inside string-escaped JSON. Each of those is a divergence the regulator can find by accident, and each breaks the package's independent verifiability downstream.

Property RFC 8785 JCS json.dumps(sort_keys=True)
Unicode NFC normalisation Yes, mandated No, passes raw codepoints through
IEEE 754 number form Yes, single canonical representation No, depends on Python repr
Recursive key sorting at every depth Yes, by codepoint order Yes at top level, depends on nested types
Escape-sequence canonicalisation Yes, minimal-escape rule No, ASCII vs ensure_ascii flag-dependent
Cross-version stability Specified, immutable Has drifted across CPython releases

Everything downstream assumes a canonical byte string. The canonical form is what makes the evidence record independently verifiable without contacting Warrant, regardless of who reconstructs it. The companion pillar at /blog/four-layer-evidence-stack sits on the same assumption: that the canonical form is reproducible. JCS makes that assumption real.

The property is testable. The regression suite includes a property-based test that takes the 200 fixture traces, encodes each through three different JSON encoders that all claim to be sorted-and-stable, and asserts identical SHA-256 hashes only on the JCS-encoded path. The other encoders fail the equivalence test on at least one trace each, every time the suite runs.

python · test_jcs_property.py
import hashlib, json, pytest, rfc8785, orjson
from warrant.fixtures import load_regression_traces

FIXTURES = load_regression_traces()  # 200 labelled traces

def _sha(b: bytes) -> str:
    return hashlib.sha256(b).hexdigest()

@pytest.mark.parametrize("trace", FIXTURES, ids=lambda t: t["id"])
def test_jcs_is_uniquely_stable(trace: dict):
    # JCS path: spec-compliant canonical bytes
    jcs   = rfc8785.dumps(trace)

    # two non-canonical encoders that look stable but are not
    py    = json.dumps(trace, sort_keys=True, ensure_ascii=False).encode()
    fast  = orjson.dumps(trace, option=orjson.OPT_SORT_KEYS)

    # JCS hash is the contract. it must be invariant.
    assert _sha(jcs) == trace["expected_jcs_sha256"]

    # non-canonical encoders are allowed to disagree, and they do.
    # Unicode NFC, IEEE 754 floats, escape rules all leak.
    if _sha(py) != _sha(jcs) or _sha(fast) != _sha(jcs):
        pytest.xfail("non-canonical encoder diverges, expected")

# across 200 traces × 3 encoders × 12 weeks of CI runs:
#   - JCS path: 100% identical hash, every run, every machine
#   - json.dumps(sort_keys=True): 11 traces diverge on combining-marks
#   - orjson(OPT_SORT_KEYS): 4 traces diverge on float-form (1.0 vs 1)

One canonical form, infinite valid encodings before it, one byte string after. That is the move.

04 · MOVE 2 · DETERMINISTIC DECODING

Set the temperature. Pin the seed.

Per-stage decoding configuration · greedy sampling for structured stages · calibrated variance for judgement

The next move is at the model boundary. Sampling parameters are the part of the LLM contract the caller controls explicitly. Set temperature to zero on the stages where exact reproducibility matters. Set a seed on every stage. Pin the model id to a specific version, never to an alias. Record the configuration in the trace metadata.

The Warrant pipeline does not run the same configuration on every stage. The Map and Extract stages are structured-output work, where the right answer is byte-identical across replays. They run at temperature zero. The Classify stage is also temperature zero. The Assess stage, where the model is rendering judgement on whether an action is in scope, in purpose, reversible, and supported by appropriate human oversight, runs at temperature 0.2. The slight variance is desired; it is the calibration of the model's confidence, not noise. The seed is locked across all stages so that the variance is reproducible within its band.

python · pipeline_config.py
PIPELINE_CONFIG = {
    "classify": {"model": "claude-opus-4-7",    "temperature": 0.0, "seed": 42},
    "extract":  {"model": "claude-sonnet-4-6",  "temperature": 0.0, "seed": 42},
    "assess":   {"model": "claude-opus-4-7",    "temperature": 0.2, "seed": 42},  # judgement
    "map":      {"model": "claude-sonnet-4-6",  "temperature": 0.0, "seed": 42},
}

# every call records its configuration in the structured log line:
# {"stage": "classify", "model": "claude-opus-4-7", "temp": 0.0, "seed": 42,
#  "cache_hit": false, "input_tokens": 4823, "output_tokens": 612, "ts": "..."}

The decision to give Assess a temperature of 0.2 was made empirically. At temperature 0.0, the Opus 4.7 stage produces a single confident answer per decision; the variance band is collapsed. At temperature 0.2, the same stage produces a calibrated answer where the model's emitted confidence and the variance across five replays line up. The eval suite measures both, the per-stage Cohen's kappa against human gold and the calibration error between emitted confidence and observed variance, and the choice of 0.2 minimises the gap between them. The seed of 42 is arbitrary; what matters is that it is fixed and recorded.

One non-negotiable point: the model id is pinned to a specific version. claude-opus-4-7 is a versioned model, not an alias. When Anthropic ships Opus 4.8, the existing pipeline does not silently start receiving 4.8 outputs. The constant has to be changed in code, the eval suite has to be re-run, and the deltas have to be reviewed. The shape of the upgrade is the same shape documented in the companion routing pillar at /blog/multi-model-routing: data, not engineering.

05 · MOVE 3 · EVAL-SET ANCHORING

The 200-trace regression set is the deterministic anchor.

PR-merge gate · per-stage Cohen's kappa floors · 99.5% citation precision floor · two model-version bumps caught

The pipeline is not allowed to ship without passing the regression set. Every prompt change, every model version bump, every code-path change runs the 200-trace regression set first. The set is stratified across the nine regulatory regimes Warrant maps to, six jurisdictions, four risk tiers, and three domain types. It was labelled by the founder and one external compliance advisor with prior buy-side and supervisory experience. Inter-annotator Cohen's kappa, human against human, sits at 0.84.

Stage Floor κ (model vs human) Current κ Failure mode caught
01 · classify_trace 0.85 0.86 Risk-tier flip on edge-case lending traces
02 · extract_actions 0.78 0.81 Action-actor confusion on multi-actor traces
03 · assess_authorization 0.80 0.83 Reversibility flag on partial-execution actions
04 · map_obligations 0.78 0.79 Sub-clause precision drift on amended texts

Each stage has its own kappa floor. The Map stage is set lower because sub-clause precision is the harder problem and human-vs-human agreement on it is itself lower. Citation precision against canonical regulator text (EUR-Lex, dfs.ny.gov, federalreserve.gov, mas.gov.sg) has its own dedicated benchmark with a 99.5% floor. If any of these gates fail, the change is blocked at PR-merge. The full eval methodology is documented in the companion engineering pillar at /blog/regulator-grade-evals; this section concentrates on the part where the eval set acts as the deterministic anchor for the stochastic model behind it.

python · pytest gate at PR-merge
import pytest
from warrant.eval import run_regression_set, KAPPA_FLOORS, CITATION_FLOOR

def test_regression_set_passes_at_merge_gate():
    results = run_regression_set(model_config=PIPELINE_CONFIG, seed=42)

    for stage, floor in KAPPA_FLOORS.items():
        actual = results[stage].cohens_kappa
        assert actual >= floor, (
            f"stage {stage}: kappa {actual:.3f} below floor {floor}; "
            f"merge blocked. compare per-trace deltas in eval-report.html"
        )

    citation = results["map"].citation_precision
    assert citation >= CITATION_FLOOR, (
        f"citation precision {citation:.4f} below floor {CITATION_FLOOR}; "
        f"merge blocked. canonical-text drift suspected."
    )

    # two model-version bumps in the past 90 days were blocked here:
    # sonnet-4-6-rc2 dropped citation precision to 98.9% on amended NYDFS text
    # opus-4-7-rc1 dropped Assess kappa to 0.78 on partial-execution traces

The eval-set anchor is what makes the rest of the system safe to upgrade. The model is non-deterministic in the small; the regression set measures whether the small variance has produced a meaningful change in the large. The kappa floors and the citation-precision floor are the contract. If a model upgrade can pass them, the upgrade ships. If it cannot, the upgrade is held until the prompts are retuned or the model id is rolled back. The decision is data, not engineering.

06 · MOVE 4 · RESIDUAL-UNCERTAINTY DISCLOSURE

Refusal is a first-class outcome.

Per-decision evidence record · four-class uncertainty enum · honesty over confidence

The regulator does not need a deterministic answer. The regulator needs an honest answer. On the stages where temperature is zero, the answer is reproducible by construction. On the stage where temperature is 0.2, the answer is calibrated within a published variance band, and that variance band is part of the evidence record.

Each per-decision record carries four fields beyond the bare answer: the decoded answer itself, the model's emitted confidence, the top five alternative completions where temperature is greater than zero, and a four-class uncertainty enum. The enum is the regulator-facing summary of how much weight to put on the answer. The four classes are deliberately small.

python · uncertainty.py
from enum import Enum
from pydantic import BaseModel
from typing import List, Optional

class UncertaintyClass(str, Enum):
    CONFIDENT  = "confident"   # temperature 0, single decoded answer, p > 0.95
    CALIBRATED = "calibrated"  # temperature > 0, top-5 inside a tight band, p > 0.80
    UNCERTAIN  = "uncertain"   # answer present but top-5 spread > threshold
    REFUSED    = "refused"     # model declined to answer, reason recorded

class DecisionRecord(BaseModel):
    decision_id: str
    answer: Optional[str]
    emitted_confidence: Optional[float]
    top_alternatives: List[str] = []
    uncertainty: UncertaintyClass
    refusal_reason: Optional[str] = None
    stage: str
    seed: int
    temperature: float
    model_id: str

# the regulator-facing artefact carries every field. nothing is hidden.
# a REFUSED record flips the package status to attestation-incomplete.
# better than a confidently-wrong answer.

A REFUSED outcome is not an error. It is a decision the model made, and the package records it the same way it records any other decision. The regulator-facing summary surfaces the refusal explicitly: this action could not be assessed against the requested obligation because the model declined to render judgement, with the reason given as X. The package status flips to attestation-incomplete, the rest of the package remains a complete attested record, and the evidence chain remains intact for the actions that were assessable. An honest gap is more defensible than a confident invention.

The refusal-rate is itself an eval metric. A sudden spike in refusal rate triggers a regression run, because it usually indicates a prompt drift, a regulator-text amendment that has changed the framing, or a model-version safety retrain that has narrowed the model's willingness to engage with regulatory categorisation. None of those is fatal, but each requires investigation before the next attestation ships.

07 · THE PROOF

Replay across 200 traces. 90 days. Three endpoints.

Trace-hash stability 100% · citation precision 99.7% → 99.6% within noise band · per-decision answer stability 100% on temperature-zero stages

The four moves above are the design. The proof is the production telemetry from the past 90 days, run across the 200-trace regression set on three model-API endpoints (Anthropic primary, Azure Foundry primary, Anthropic primary fallback). Three reproducibility metrics are tracked.

100% · trace-hash
CANONICAL HASH STABILITY
200 traces × 90 days × 3 endpoints. RFC 8785 JCS works. The canonical form is invariant across Python version, OS locale, and runtime.
99.7→99.6 · 90-day drift
CITATION PRECISION
Within noise band. The eval gate caught two model-version bumps that would have dropped to 98.9% and 99.1%; both blocked at PR-merge.
100% · temp 0 stages
ANSWER STABILITY
Byte-identical replay across the Classify, Extract, and Map stages. Five replays per trace, all stages, all endpoints.
96.8% · Assess
DECISION-CLASS STABILITY
Identical decision class on the temperature-0.2 stage across 5 replays per trace. Residual 3.2% variance is published per decision in the uncertainty band.

The first metric, trace-hash stability, is the cleanest result. JCS is doing exactly what RFC 8785 specifies. Two engineers, on different operating systems, in different time zones, with different Python versions installed, reduce the same trace to the same canonical form every time. The resulting evidence record is independently verifiable without contacting Warrant, regardless of who reconstructs it.

The second metric, citation precision, is where the eval-anchor pays off. The 99.7% baseline is measured against canonical regulator text. Across 90 days, the metric drifted to 99.6% within the regression-set noise band, well above the 99.5% floor. More importantly, the same gate caught two model-version bumps that would have dropped citation precision below the floor. A Sonnet 4.6 release-candidate would have dropped to 98.9% on amended NYDFS Part 500 text; an Opus 4.7 release-candidate would have dropped to 99.1% on a Federal Reserve SR 11-7 sub-clause. Both were rolled back to the prior pinned model id while the prompts were retuned. Neither was shipped to the production pipeline.

The third metric, per-decision answer stability, is where the deterministic-decoding move pays off. On the Classify, Extract, and Map stages, byte-identical replay holds at 100%. On the Assess stage, the decision class is identical in 96.8% of replays; the residual 3.2% variance is published per decision in the uncertainty band, which is exactly the calibrated behaviour the temperature-0.2 setting was chosen for. The variance is not noise. It is the model's calibrated honesty about borderline cases, surfaced explicitly in the artefact.

08 · NIST AI RMF · MEASURE FUNCTION

Where this hooks into RMF MEASURE 2.7.

TEVV processes for AI systems · reliability · accuracy · robustness · model uncertainty quantification

The NIST AI Risk Management Framework, in its MEASURE function, calls for "TEVV processes for AI systems" that cover, among other things, reliability, accuracy, and model uncertainty quantification. MEASURE 2.7 is the sub-control that addresses the test, evaluation, validation, and verification process at the system level. The four moves above are a TEVV instantiation against that sub-control.

  • Reliability · the trace-hash stability metric is the system-level reliability statement. The canonical artefact reproduces deterministically given the same logical inputs.
  • Accuracy · the citation precision benchmark and the per-stage Cohen's kappa floors are the system-level accuracy statement. Both are regression-tested and both are enforced at the merge gate.
  • Resilience · the dual-provider replay (Anthropic plus Azure Foundry) is the system-level resilience statement. A divergence between the two providers is itself a flag.
  • Model uncertainty quantification · the four-class uncertainty enum and the published variance band on the Assess stage are the model-uncertainty statement. The uncertainty is part of the artefact, not a footnote.

The deeper hook is documented in the companion regulator pillar at /blog/nist-ai-rmf. The relevant point here is that the engineering moves are not auxiliary; they are the TEVV process itself, expressed in code. Every NIST RMF audit question against MEASURE 2.7 has a one-table-row answer in the regression-set telemetry.

09 · ISO/IEC 42001 · A.6.2.4

Where this hooks into ISO 42001 V&V.

ISO/IEC 42001 Annex A.6.2.4 · verification and validation of AI systems · documented evidence

ISO/IEC 42001 is the AI management-system standard. Annex A.6.2.4 calls for "verification and validation" of AI systems, with documented evidence that the system performs to its stated specification under representative conditions. The 200-trace regression set, plus the four moves, plus the production telemetry, plus the published kappa floors, are the V&V evidence package against A.6.2.4. The companion regulator pillar at /blog/iso-iec-42001-ai-management-system walks the full annex.

The relevant property for this post is that ISO 42001 V&V evidence is auditable at the management-system level, not just the system-instance level. An auditor reviewing the AIMS asks show me the V&V process, show me when it ran, show me what it caught, show me what was rolled back, show me how the rollback decision was made. The eval-set telemetry plus the merge-gate logs plus the model-version-pin history answer all of those. The V&V is documented because it is the merge gate; nothing about it is reconstructed after the fact.

10 · EU AI ACT · ARTICLE 12

Where this hooks into Article 12 logging.

EU AI Act, Regulation (EU) 2024/1689, Article 12 · automatic recording of events over the lifetime of the system

Article 12 of the EU AI Act lists the categories of events to record over the lifetime of a high-risk AI system. The categories are not abstract; they include the period of use, the input data the system operated on, the natural persons involved in human oversight, and the categorisation of events that may lead to a substantial modification.

"High-risk AI systems shall technically allow for the automatic recording of events (logs) over the lifetime of the system." EU AI Act · Article 12(1)

The Article 12 logging requirement is the part the four engineering moves directly enable. Without canonicalisation, two equivalent traces hash differently and the audit-replay test fails: the regulator cannot tie a logged event to the canonical record. Without deterministic decoding, the model's per-decision output is not reproducible and the evidence record drifts between runs. Without eval-anchoring, model-version drift silently changes the categorisation outcomes that Article 12 requires. Without residual-uncertainty disclosure, the categorisation is presented with false confidence and the regulator-facing record is misleading.

The four moves are how the Warrant pipeline is technically able to satisfy Article 12(1) over the lifetime of the system, not just at the moment of one attestation. The companion regulator pillar at /blog/eu-ai-act-article-12 walks the full article and the per-clause mapping.

11 · THE RESIDUAL GAP

Three things still imperfect today. The V0.5 work.

Backend non-determinism · prompt-cache state · refusal-rate drift · independent deterministic replay in V0.5

The four moves close most of the gap. They do not close all of it. Three sources of residual variance remain, and each has an explicit mitigation that is shipping or in flight.

Backend non-determinism beyond Warrant's control

GPU kernel non-determinism in the model provider's stack, plus replica-routing variance at the load balancer, plus floating-point reduction order across hardware revisions. Mitigation: run the same trace through two providers (Anthropic primary plus Azure Foundry primary) and flag any divergence. Two divergences in 90 days, both on a single token, both inside the uncertainty band, both surfaced rather than suppressed.

Prompt-cache state

Anthropic's prompt-caching feature can change response timing and, in rare race conditions, response content. Mitigation: log cache_hit and cache_creation_input_tokens in trace metadata. A replay that diverges with identical cache state is a real divergence; a replay that diverges with different cache state has a known explanation and is investigated.

Refusal-rate drift

As the model trains on safety, refusal patterns change over time. A prompt that was assessable last quarter may produce a REFUSED outcome this quarter. Mitigation: refusal-rate is itself an eval metric tracked over time. A spike triggers a regression run and a prompt review.

The V0.5 work makes an attested record independently verifiable without contacting Warrant: a regulator re-runs the attested trace through a frozen model snapshot taken at the time of original attestation, and confirms byte-equivalence on every temperature-zero stage. The frozen snapshot is the model id pinned to a specific Anthropic model version, with the seed fixed and the prompt-cache disabled. The two outcomes are byte-equivalent (replay PASS) or divergent (replay FAIL with the divergent stage and offset surfaced), given the package_id and the pinned model snapshot. The work is in the V0.5 milestone because the prerequisite, a stable Anthropic model-snapshot pinning interface, is shipping in the same window.

Until that capability ships, the audit-replay test is run internally on a nightly cadence against the production attestation history, and the results are published in the trust-page reproducibility table. The gap is narrow; it is not zero; it is documented.

12 · THE SHAPE OF THE CLOSURE

The model is probabilistic. The artefact is not.

The four moves do not make an LLM deterministic · they make the evidence deterministic given the LLM's outputs

The closure is worth stating cleanly. The four moves do not make an LLM deterministic. They make the evidence deterministic given the LLM's outputs. The model can still be probabilistic. The artefact the regulator reads is reproducible.

Two engineers with the same trace and the same Warrant package_id can independently confirm the same canonical record, the same regulator citations, the same per-decision uncertainty class, without contacting Warrant. They cannot necessarily verify the same per-token attention activations, and they do not need to. The chain of trust is not I trust the kernel. The chain of trust is I can reduce the trace to its canonical form, confirm the record is the one that was attested, and re-run the eval suite to check that the system that produced the artefact still passes its contract. Each step is independently runnable. None of the steps depend on a non-reproducible execution.

That is the engineering closure between a stochastic model and a deterministic audit. The model is one thing the regulator cannot run; the evidence record is something the regulator can. The four moves are how those two truths coexist. Trace canonicalisation under RFC 8785 is the byte-form discipline. Deterministic decoding is the model-boundary discipline. Eval-set anchoring is the system-level discipline. Per-decision residual-uncertainty disclosure is the artefact-level discipline. Each one closes a gap the others cannot. Together they produce a record the regulator can replay and verify, on a Tuesday, on a different machine, in a different time zone, three years after the attestation was issued.

The model can still be probabilistic. The artefact the regulator reads is reproducible. The closure · why stochastic evidence is a solvable problem
13 · FAQ

Questions an architect asks first.

FAQ · sourced from inbound from platform and ML engineering teams during the May 2026 launch week
Why is temperature 0 not enough on its own?

Temperature 0 makes the next-token sampler greedy, but it does not make the underlying execution deterministic. GPU kernel non-determinism in attention, provider-side load balancing across model replicas, prompt-cache hit-or-miss state, and floating-point reduction order across devices all leak variance into the response. Two API calls with temperature 0 to the same model can return different bytes. Temperature 0 closes one source of variance; the canonicalisation, eval-anchor, and residual-uncertainty moves close the rest.

What is RFC 8785 and why does it matter?

RFC 8785 is the JSON Canonicalization Scheme. It produces a unique deterministic byte serialisation of equivalent JSON values, regardless of Python version, OS locale, dict-insertion order, or float-formatting rules. It handles Unicode NFC normalisation, IEEE 754 number form, and recursive sorting at every nesting depth. Two JSON documents that mean the same thing reduce to the same canonical bytes under JCS. Two documents that differ in any meaningful way reduce to different bytes. The whole evidence record relies on JCS being the canonical form, which is what makes that record independently verifiable without contacting Warrant.

Can two replays of the same trace produce different evidence?

On the temperature-0 stages (Classify, Extract, Map): no. Across 200 traces over 90 days, byte-identical replay holds at 100%. On the temperature-0.2 stage (Assess), the decision class is identical in 96.8% of replays; the residual 3.2% variance is published per decision in the uncertainty band. The artefact the regulator reads is reproducible to byte equivalence on the stages that matter and to declared-uncertainty equivalence on the judgement stage.

How does the eval suite catch model-version-bump regressions?

The 200-trace regression set runs at PR-merge gate and on every model-version constant change. Per-stage Cohen's kappa floors are enforced (Classify 0.85, Extract 0.78, Assess 0.80, Map 0.78) alongside a 99.5% citation-precision floor. Two model-version bumps in the past 90 days were blocked at the gate: one would have dropped citation precision to 98.9% on a Sonnet revision, the other to 99.1% on an Opus revision. Both were rolled back to the prior pinned model id while the prompts were retuned.

What happens when the model refuses to answer?

Refusal is a first-class outcome, not an error. The per-decision evidence record carries a four-class uncertainty enum: CONFIDENT, CALIBRATED, UNCERTAIN, REFUSED. A refusal flips the package status to attestation-incomplete and surfaces the reason. Better than a confidently-wrong answer; the regulator gets honesty about what the model could and could not assess. Refusal-rate is itself an eval metric tracked over time, so a sudden refusal spike triggers a regression run.

Is byte-identical replay actually achievable in practice?

On the trace-hash and the canonical evidence record: yes, deterministically, by construction. JCS makes the hash a function of the JSON value, not of the encoding. On the LLM stage outputs at temperature 0: empirically yes, 100% across our 200-trace 90-day window, but with the caveat that the model provider's stack carries residual non-determinism from kernel behaviour and load balancing. The mitigation is dual-provider replay: the same trace runs through Anthropic primary plus Azure Foundry primary, and any divergence flags as an integrity event.

How does this connect to Anthropic's prompt-caching feature?

Prompt caching is a latency and cost optimisation; under most conditions it does not change response content. It can change response timing and, in rare race conditions, can interact with cache-eviction in ways that surface different tokens. The mitigation is to log the cache_hit and cache_creation_input_tokens fields from the response in trace metadata. If a replay produces a different byte output and the cache-hit state differs, the integrity check has a known explanation. If a replay diverges with identical cache state, that is a real divergence event.

What does the V0.5 deterministic-replay work give a regulator?

The V0.5 work makes an attested record independently verifiable without contacting Warrant. A regulator re-runs an attested trace through a frozen model snapshot, taken at the time of original attestation, and confirms byte-equivalence on every temperature-0 stage. The frozen snapshot is the model id pinned to a specific Anthropic model version, with the seed fixed and the prompt-cache disabled. Two outcomes: byte-equivalent (replay PASS) or divergent (replay FAIL with the divergent stage and offset surfaced), given the package_id and the pinned model snapshot.

14 · READ THE SOURCE

Read the source directly.

SEE THE PIPELINE RUN

Drop a trace. Replay it. Read the same canonical record twice.

The fastest way to read the engineering closure is to run a sample trace through the live demo, download the package, re-run it, and confirm the canonical record matches the first run without contacting Warrant. Two runs, one canonical form, one byte-identical evidence record on the temperature-zero stages. The pipeline you have just read about, in production, on a real trace.