NIST AI 100-2 adversarial ML taxonomy for agents

01 · WHAT NIST AI 100-2 IS

A published taxonomy. Not a regulation.

This report develops a taxonomy of concepts and defines terminology in the field of adversarial machine learning. The taxonomy is arranged in a conceptual hierarchy that includes key types of ML methods, life cycle stages of attack, and attacker goals, and identifies methods for mitigating and managing the consequences of those attacks. NIST AI 100-2e2025 · Abstract · 24 March 2025

One paragraph from the abstract carries the load. The document is a taxonomy and a set of terminology. It is descriptive, not prescriptive. It does not say thou shalt. It says if your AI system is attacked, this is the vocabulary auditors and engineers should be using when you describe what happened and what you did about it.

The lineage matters. NIST AI 100-2 first appeared in 2019 as a draft. The 2023 edition gave it a stable taxonomy for predictive machine learning. The 2024 update extended the taxonomy to generative AI under the same four-class structure. The 2025 final, formally NIST.AI.100-2e2025, consolidated both into a single document.

What it is not. It is not an EU harmonised standard. It is not a NIST-issued conformity scheme. It is not, in itself, a defence in any litigation. What it is, in operational terms, is the closest thing engineering teams currently have to a shared dictionary for AI-specific attacks. That makes it the lowest-friction way to translate Article 15(5) of the EU AI Act into a per-decision evidence pattern.

"A taxonomy is not a regulation. A taxonomy is what makes a regulation operational."Warrant Engineering · 2026-05-11

The taxonomy is organised across three axes. The ML method axis distinguishes predictive AI from generative AI. The life-cycle axis separates training-time attacks from inference-time attacks. The attacker-goal axis names what the attacker is trying to achieve, whether that is integrity violation, availability violation, or privacy violation. The four top-level attack classes sit at the intersection of those axes.

02 · EVASION

Evasion · adversarial inputs at inference time.

An evasion attack is an inference-time attack on a predictive AI system. The model is already trained. The training pipeline is untouched. The attacker modifies the input so that the model's output is wrong in a way that benefits the attacker.

The canonical example is the adversarial image. A photograph of a stop sign with a precisely calibrated perturbation invisible to humans, classified by a vision model as a speed-limit sign. The mathematics generalises. Tabular features in a credit-scoring model. Tokenised text in a sentiment classifier. Network packets in an intrusion-detection system.

NIST AI 100-2 names the attacker's capability on a three-step ladder. White-box assumes the attacker has the model architecture and weights. Grey-box assumes partial knowledge, often the architecture but not the weights, or a known training corpus. Black-box assumes only query access through the production interface.

white

Full knowledge of the target model. Gradient-based attacks work directly. Strongest threat model, often used as a benchmark for defensive evaluations. EXAMPLE · projected gradient descent (PGD) crafting against a known weights file.

grey

Partial knowledge. Architecture without weights. Training distribution without parameters. EXAMPLE · transfer attacks crafted on a surrogate model trained on the same dataset.

black

Query-only access via the production API. Decision boundaries are inferred by probing. EXAMPLE · ZOO, square-attack, or model-extraction precursors to a transfer attack.

The engineering implication for the evidence record is direct. For any predictive decision the agent took, the trace must record the input that produced the decision and a typed indicator of whether that input passed an adversarial-input check. The check itself is layered. Statistical detection on input distribution. Distance from training-set neighbours. Optionally, a model-specific certified-bound check.

What the trace must not claim is immunity. NIST AI 100-2 is explicit that no current defence eliminates evasion. The honest signal is detection coverage with a known false-negative rate, not a binary passed.

03 · POISONING

Poisoning · training-time attacks.

Poisoning is the training-time counterpart to evasion. The attacker has access to the training data, or to some part of it, or to the pipeline that ingests it. The poisoning is in the corpus, not in the request.

NIST AI 100-2 separates poisoning by attacker goal. Availability poisoning degrades the model's accuracy generally. Integrity poisoning causes incorrect outputs for specific targeted inputs while leaving general accuracy intact. Backdoor attacks install a hidden trigger pattern such that any input carrying the trigger is misclassified to an attacker-chosen label.

Two capability levels. Full training-set access assumes the attacker controls or substantially modifies the corpus. Partial access, more realistic in 2026 supply chains, assumes the attacker contaminates a subset, perhaps a few percent of an open dataset, perhaps a single internet source that gets scraped.

AVAILABILITY

General accuracy degraded. Model becomes unreliable across many inputs. Attacker wants the system removed from service.

INTEGRITY

Specific targeted inputs misclassified. General test accuracy unaffected. Attacker wants a particular decision to fall a particular way.

BACKDOOR

Hidden trigger pattern installed. Any input carrying the trigger flips to an attacker-chosen label. Discoverable by trigger-search only.

SUPPLY-CHAIN

Poisoning enters through a foundation model, a third-party dataset, or a fine-tune from an unknown source. The provenance gap is the vulnerability.

The evidence pattern for poisoning is upstream of the per-decision trace. It lives in metadata about the model, not the request. Training-data provenance, dataset hashes, source attestation for fine-tune corpora, the integrity of any retrieval-augmented index. NIST AI 100-2 does not prescribe the artefacts. It names the class so that auditors can ask the right question.

For a 2026 generative-AI deployment, the operational reality is that almost no provider can prove the absence of poisoning in a foundation model. The defensible posture is documented provenance for everything inside the deployer's control, and a contractual chain of attestations for everything outside it. That is the cybersecurity posture Article 15(5) asks for, read alongside the technical documentation under Annex IV.

04 · PRIVACY

Privacy · extracting from the model.

The third class is privacy attacks. The attacker is not trying to misclassify an input or corrupt the training pipeline. The attacker is trying to extract information about the training data, the model parameters, or the individuals whose data was used to train.

NIST AI 100-2 names four sub-types. Membership inference determines whether a specific record was in the training set. Attribute inference recovers sensitive attributes of training records. Model inversion reconstructs representative training examples from model outputs. Training-data extraction, the strongest, recovers literal records from generative models that have memorised them.

mem

Membership inference. Yes-or-no on whether a record was in the training set. RISK · GDPR Article 4(1) personal data leakage. EU AI Act Article 15(5) confidentiality.

attr

Attribute inference. Reconstruct one or more sensitive attributes given a partial record. RISK · sensitive special-category data under GDPR Article 9 may be inferred from non-special-category features.

inv

Model inversion. Reconstruct representative training examples by inverting the prediction function. RISK · facial-recognition systems and clinical models are particularly vulnerable.

ext

Training-data extraction. Recover literal training records, often via repeated structured prompting of large generative models. RISK · most acute for foundation models trained on web-scale corpora that include PII.

Capability ranges from API-only access through full model-weights access. Differential privacy is the only mitigation in the taxonomy that offers a formal mathematical guarantee, at a measurable cost in utility. Everything else is empirical hardening: query rate-limiting, output filtering, post-hoc memorisation audits.

For the per-decision record, the evidence field is whether the action interacted with a privacy-sensitive surface, and if so, which differential-privacy or output-filtering control was active. The artefact does not claim the model is private. It records what privacy posture was in force at the time of the decision.

05 · ABUSE

Abuse · the class the 2024 revision added.

The fourth class is the one that justifies the 2024 update and the 2025 final. The original 2023 taxonomy did not have an Abuse category. With generative AI in widespread production, the attacker stopped trying to break the model and started trying to instruct it.

Abuse covers adversarial use of generative AI systems. The model is functioning correctly in the predictive-AI sense. The vulnerability is that functioning correctly for a chat or agent system means doing what the input asked. When the input is hostile, the model executes the hostile instruction. The threat surface is the prompt.

Adversarial abuse of generative AI systems encompasses techniques used to misuse or bypass the intended behaviour of a generative AI system, including direct prompt injection, indirect prompt injection through retrieved or tool-returned content, and jailbreak attacks intended to override safety training. [verification pending exact wording] Paraphrase of NIST AI 100-2e2025 abuse section

Three named patterns sit inside Abuse. Direct prompt injection is an instruction in the user prompt that overrides the system prompt. Indirect prompt injection is the same instruction delivered through content the model reads from another source, a retrieved document, a tool's return value, an email body, a webpage. Jailbreak is a stylised prompt that circumvents the model's safety training to produce content the operator did not authorise.

The capability story differs from predictive AI. Black-box is the default. The attacker rarely needs weights. They need the prompt surface, plus, increasingly, any path through which untrusted content reaches the model. For a retrieval-augmented agent, every retrieval source is a prompt-injection vector. For a tool-using agent, every tool's return value is.

The mitigation story is empirical and unsettled. System-prompt hardening, instruction-tuning for resistance, content classifiers on inputs, content classifiers on outputs, structural separation between trusted system instructions and untrusted user or retrieved content. NIST AI 100-2 enumerates the techniques without ranking them. The 2026 honest engineering answer is layered defence with structured per-decision evidence of which defences were in force.

06 · MITIGATIONS

Mitigation taxonomy · what defences map to what attacks.

NIST AI 100-2 pairs the attack taxonomy with a mitigation taxonomy. Three observations matter for engineering teams.

First. Mitigations have capability and knowledge requirements of their own. Adversarial training requires retraining the model on adversarially perturbed examples. Certified defences require model architectures amenable to formal bounds and impose accuracy costs. Differential privacy requires bounded privacy budgets and reduces utility. Input sanitisation requires distributional knowledge of legitimate inputs. Output filtering requires a classifier downstream of the model.

Second. No mitigation generalises across all four attack classes. A defence that hardens evasion may have no effect on poisoning. A defence that mitigates membership inference may be irrelevant to prompt injection. The taxonomy is explicit on this point.

M·1

Adversarial training. Augment the training set with adversarially perturbed examples. TARGETS · evasion. COST · clean accuracy reduction, training-cost multiplier, no transfer to poisoning or abuse.

M·2

Certified defences. Provable robustness bounds for bounded perturbations. TARGETS · evasion in narrow regimes. COST · model-class restrictions, scalability constraints.

M·3

Differential privacy. Bounded influence of any single training record on the trained model. TARGETS · membership inference, attribute inference, training-data extraction. COST · utility reduction proportional to privacy budget.

M·4

Input sanitisation and anomaly detection. Statistical filters on inputs. TARGETS · evasion at inference, some indirect prompt-injection variants. COST · false positives on legitimate edge cases.

M·5

Output filtering and content classifiers. Classifiers downstream of the model. TARGETS · abuse, particularly jailbreak. COST · false-negative tail by construction.

M·6

Provenance and watermarking. Cryptographic or statistical marking of model outputs and training corpora. TARGETS · downstream attribution, not direct attack prevention. COST · detection-only.

Third, and most useful for an attestation discipline. The mitigation taxonomy is the natural home for per-decision evidence. The audit question is rarely do you defend against evasion. The audit question is which evasion defence was in force when this decision was taken on this person on this date, and what did it produce. That maps to a structured field in a trace, not a marketing claim in a product page.

07 · EU AI ACT BRIDGE

Article 15(5) · the regulator's sentence.

High-risk AI systems shall be resilient against attempts by unauthorised third parties to alter their use, outputs or performance by exploiting system vulnerabilities. The technical solutions aimed at ensuring the cybersecurity of high-risk AI systems shall be appropriate to the relevant circumstances and the risks. Regulation (EU) 2024/1689 · Article 15(5) · 13 June 2024

One sentence. Two operative phrases. The first, attempts by unauthorised third parties to alter their use, outputs or performance by exploiting system vulnerabilities, is the threat surface. The second, appropriate to the relevant circumstances and the risks, is the proportionality test the regulator will apply when the provider's defence is challenged.

The Regulation does not enumerate the system vulnerabilities it has in mind. The standards body cannot enumerate them either, because the literature evolves quarterly. So the Regulation leaves the question open and points to harmonised standards under Articles 40 and 41 to fill in the technical content over time. Until those harmonised standards are published, providers and notified bodies fall back on the recognised state of the art.

NIST AI 100-2 is one widely recognised statement of that state of the art for AI-specific threats. It is not cited by name in the Regulation. It is not declared by the Commission to confer presumption of conformity. It is, in 2026, the most defensible single document an engineering team can point to when an auditor asks what system vulnerabilities they have in fact considered.

The other side of the bridge is the recital. Recital 76 of the Regulation specifies that cybersecurity for high-risk AI systems includes data poisoning, adversarial examples, model evasion, confidentiality attacks and model flaws. That list sits inside the NIST four-class taxonomy almost without translation. For the wider Article 15 obligation (accuracy, robustness, and cybersecurity together), see the Article 15 reading filed alongside. For the application-layer companion taxonomy, see the OWASP LLM Top 10.

08 · FIELD MAPPING

Where Warrant maps NIST AI 100-2 into the trace.

Warrant turns the four-class taxonomy into four typed evidence fields per action. The mapping is mechanical and the field names are stable across deployments.

EVA

Evasion · adversarial input at inference time. FIELD · trace.actions[*].adversarial_input_check (detector_id, score, threshold, decision, false_negative_rate_known).

POI

Poisoning · training-set integrity. FIELD · metadata.training_data_provenance (dataset_hashes, fine_tune_source_attestation, foundation_model_provider_attestation).

PRV

Privacy · membership inference, attribute inference, model inversion, training-data extraction. FIELD · trace.actions[*].privacy_attack_surface (dp_budget_active, output_filter_active, rate_limit_active, sensitive_attribute_present).

ABU

Abuse · direct prompt injection, indirect prompt injection, jailbreak. FIELD · trace.actions[*].abuse_pattern_check (injection_score, indirect_source_trust, jailbreak_classifier_decision, safety_classifier_decision).

The fields are not assertions of immunity. They are records of what was checked, with what detector, against what threshold, with what result. An auditor reading the trace can verify that the four NIST classes were considered for the decision in front of them, and can read the false-negative posture out of the record itself.

Sample evidence package · NIST four-class fields populatedINDEPENDENTLY VERIFIABLE WITHOUT CONTACTING WARRANT

→ /v/7de85ceaeac42a47

09 · OWASP OVERLAP

Cross-reference · NIST taxonomy versus OWASP LLM Top 10.

The OWASP LLM Top 10 is the parallel artefact most engineering teams know by name. The two documents serve different functions. NIST AI 100-2 is the taxonomy. OWASP LLM is the prioritised practitioner list.

The overlap is partial and non-controversial. OWASP LLM01 prompt injection sits squarely inside NIST Abuse. OWASP LLM02 insecure output handling intersects NIST Abuse and, where outputs feed downstream classifiers, NIST Evasion. OWASP LLM03 training-data poisoning is a one-to-one with NIST Poisoning. OWASP LLM06 sensitive information disclosure maps to NIST Privacy.

What OWASP adds that NIST does not, and vice versa, is a useful filter. OWASP is more applied: each item is a category of finding an engineer can fix in code or configuration this quarter. NIST is more structural: each class is a frame an auditor can apply across an entire system. The compliant 2026 posture cites both.

The Article 15(5) auditor, on a current reading, will accept either as evidence the team considered AI-specific threats systematically. A team that cites neither will be asked which published reference they did consider. The answer cannot be silence.

10 · FAQ

Questions a security officer asks first.

How does NIST AI 100-2 relate to EU AI Act Article 15(5)?

Article 15(5) requires high-risk AI systems to be resilient against attempts by unauthorised third parties to alter use, outputs, or performance through exploiting system vulnerabilities. NIST AI 100-2e2025 is one widely recognised published taxonomy that translates that abstract phrase into operational categories: evasion, poisoning, privacy, abuse. It is not cited by name in the Regulation, but it is the engineering-quality reference U.S. and many international auditors apply when assessing AI-specific cybersecurity.

Does NIST AI 100-2 conformity satisfy EU AI Act cybersecurity?

No single publication is declared by the European Commission to confer presumption of conformity for Article 15(5). The harmonised standards process under Articles 40 and 41 is still in progress. Aligning to NIST AI 100-2e2025 is strong evidence of state-of-the-art engineering practice. It is not, on its own, a legal safe harbour.

What is the difference between Evasion and Abuse?

Evasion is an attack at inference time on a predictive ML system. The attacker crafts an input that causes a misclassification or a wrong decision while the model behaves as designed. Abuse is the adversarial use of a generative AI system, primarily through the prompt surface. The model is not misclassifying. The model is doing what it was asked, where the asking was hostile. Prompt injection and jailbreak sit inside Abuse.

How is the Abuse class different in the 2024-2025 revision?

The original 2019 and 2023 editions of NIST AI 100-2 focused on predictive ML. The 2024 update and the 2025 final (NIST.AI.100-2e2025) extended the taxonomy to generative AI and large language models, including a dedicated treatment of direct prompt injection, indirect prompt injection through retrieved or tool-returned content, and jailbreak techniques. Abuse is the structural home for those threats.

What mitigation has the strongest evidence base?

Differential privacy for training-data privacy attacks has the strongest formal-guarantee story. Adversarial training raises the bar for evasion at known cost in clean accuracy. Certified defences offer provable bounds for small perturbations. For Abuse, the picture is less settled and the empirical literature evolves quarterly. The honest engineering answer in 2026 is layered defence with per-decision evidence.

Does Warrant produce per-decision adversarial-check evidence?

Yes. Every action in the trace carries a structured adversarial-check record. Evasion, poisoning, privacy and abuse surfaces are populated per action, and each action produces a record mapped to a specific EU AI Act obligation that is independently verifiable without contacting Warrant. The evidence is per-decision, not per-system, which is what Article 15(5) read in conjunction with Article 12 paragraph 2 requires.

How does NIST AI 100-2 interact with ISO/IEC 27090?

ISO/IEC 27090 is the international standard on guidance for addressing security threats to AI systems. It overlaps in scope with NIST AI 100-2 but is structured as guidance rather than a taxonomy. The two reference each other. A defensible 2026 posture cites the NIST taxonomy for vocabulary and ISO/IEC 27090 for controls, with neither displacing the harmonised standards process under EU AI Act Article 40.

11 · READ THE SOURCE

Read the source directly.

Authored by Warrant Engineering, the engineering function at Warrant. [email protected]. Editorial commentary on a published technical taxonomy. Not legal advice. The publication identifier NIST.AI.100-2e2025 and the title Adversarial Machine Learning · A Taxonomy and Terminology of Attacks and Mitigations reflect the NIST CSRC final release of 24 March 2025. Where the taxonomy text is paraphrased or summarised, the canonical source is the PDF linked above.

NIST AI 100-2, line by line.