Threats to AI Systems

Executive summary: AI systems face a growing set of security, integrity, privacy, and safety threats across their lifecycle.

This short synthesis groups the principal threat categories, maps common attack surfaces, cites prominent prior taxonomies, and gives a one‑page practitioner” quick card” table for immediate operational use.

The goal: concise, accurate grounding for researchers, engineers, and security teams so that work on defenses is prioritized where it matters.

Part 1 — Threat taxonomy (short synthesis)

High‑level categories

  • Data‑integrity attacks

    • Data poisoning: an adversary inserts, modifies, or deletes training data so that the learned model behaves incorrectly (e.g., backdoor triggers, label flipping, targeted degradation).

    • Data inference and privacy attacks: extracting sensitive training data or attributes via membership inference, model inversion.

  • Model‑targeted attacks

    • Model stealing/extraction: adversary queries a model (black‑box or white‑box) to reconstruct model parameters, replicate functionality, or infer proprietary architecture.

    • Model tampering: unauthorized modification of model weights or architecture (trojaning after compromise).

  • Input‑surface attacks

    • Adversarial examples: carefully crafted inputs (often small perturbations) that cause misclassification or harmful output at inference time.

    • Prompt injection (for LLMs and promptable models): maliciously crafted inputs that cause an LLM or chained prompt to reveal secrets, follow unsafe instructions, or bypass policy controls.

  • Deployment and runtime attacks

    • Evasion attacks: test‑time inputs that cause model failure without altering the model.

    • Model extraction via API abuse, resource‑use attacks (denial of service).

  • Supply‑chain and third‑party dependencies

    • Poisoned pretrained models or datasets from third parties; compromised training pipelines, CI/CD, or model registries.

    • Malicious or vulnerable libraries and tooling that alter model behavior or leak secrets.

  • Governance, specification, and misuse threats

    • Specification mismatches, goal misalignment, and insufficient evaluation lead to unsafe generalization or misuse.

    • Adversarial use: benign models repurposed by attackers for phishing, deepfakes, misinformation, and automation of harmful tasks.

Key properties across categories

  • Target: data, model, or deployment pipeline.

  • Stage: training (including pretraining), model distribution, inference/runtime.

  • Access model: white‑box (full access), black‑box (query only), or side‑channel.

  • Scope: targeted (single or small set of inputs/users) vs. indiscriminate (broad).

  • Detectability: overt (easy to detect anomalous training artifacts) vs. covert (well‑concealed backdoors or subtle performance drift).

Part 2 — Typical attack surfaces and concrete examples

  1. Data (training and validation datasets)

  • Threats: poisoning, Trojan insertion, mislabeled or low‑quality data, and privacy leakage.

  • Example: A self‑driving car perception model trained on crowdsourced images, in which an adversary injects images with subtle stickers that create a backdoor: stop signs with a specific sticker are misread as speed limit signs.

  • Why it matters: Most models depend on large datasets; compromised data can yield persistent, hard‑to‑detect faults.

  1. Model (weights, architecture, hyperparameters)

  • Threats: model stealing, tampering, insertion of trojans during fine‑tuning, and extraction of sensitive model internals.

  • Example: An API‑served image classifier is queried extensively with adaptive probing to reconstruct a surrogate model, which is then used to craft transferable adversarial examples or to remove attribution.

  • Why it matters: model theft undermines IP rights, enables downstream attacks, and can mask malicious behavior.

  1. Deployment and inference

  • Threats: adversarial examples, prompt injection, API abuse, input sanitization failures, and side channels.

  • Example: An attacker crafts adversarial audio that is unintelligible to humans but causes a voice assistant to execute high‑privilege commands.

  • Why it matters: attacks here are real‑time, can subvert services, and may bypass offline safeguards.

  1. Supply chain (pretrained models, datasets, libraries, hardware)

  • Threats: malicious third‑party checkpoints, compromised model registries, poisoned dependency packages.

  • Example: A widely used open‑source checkpoint includes a hidden trigger that activates when a specific sentence is present, causing a large language model to output misinformation.

  • Why it matters: supply‑chain compromises scale: one malicious artifact can infect many downstream systems.

  1. Operational and human factors

  • Threats: weak access controls, poor monitoring, inadequate validation, social engineering, and misconfiguration of prompts and pipelines.

  • Example: An engineer accidentally exposes API keys in a public repository; attackers use those keys to extract models or run costly inference.

  • Why it matters: human and process failures often enable technical attacks or amplify their impact.

Part 3 — Known taxonomies and prior work (select references)

  • Biggio & Roli (2018): adversarial machine learning at a systematic level — attacks and defenses in the training and test phases.

  • Barreno et al. (2006): taxonomy of attacks against learning algorithms (poisoning, exploratory attacks).

  • Papernot et al. (2016): model extraction and black‑box attacks on DNNs.

  • Tramèr et al. (2016): model extraction via prediction APIs.

  • Kurakin, Goodfellow et al., and Szegedy et al.: origins and systematic study of adversarial examples.

  • Carlini & Wagner (2018) and related work: attacks and defenses for models and trojans.

  • Adversarial ML Threat Matrix (community efforts): mapping threats to assets and mitigations. Note: This list is selective; cite the latest literature as you drill down for formal publication.

Part 4 — Risk factors that increase susceptibility

  • Large, uncurated datasets from unknown sources.

  • Publicly accessible APIs that allow unlimited probing or a large number of queries.

  • Use of third‑party checkpoints without integrity verification.

  • Models are exposed as black boxes without query throttling or anomaly detection.

  • Lack of input sanitization or chain‑of‑trust checks for instruction‑based models.

  • Poor key and secret management, and weak privilege separation between training and deployment environments.

Part 5 — Defenses and mitigation patterns (high level)

  • Data hygiene: provenance tracking, dataset versioning, provenance and metadata, data validation and anomaly detection, robust training methods (outlier detection, robust estimators).

  • Access controls: authentication, authorization, rate limiting, usage monitoring, differential pricing for sensitive APIs.

  • Model hardening: adversarial training, certifiable robustness (where feasible), gradient masking avoided; ensemble methods and randomization as partial mitigations.

  • Privacy protections: differential privacy for training, secure multiparty computing for collaborative training, private query mechanisms.

  • Supply‑chain controls: signed checkpoints, reproducible builds, model and dataset scanning, vetted registries.

  • Runtime monitoring: anomaly detection in inputs and outputs, guardrails for LLMs (policy models, red teaming), logging and alerting, human‑in‑the‑loop for high‑risk decisions.

  • Testing and verification: continuous evaluation against threat suites (poisoning tests, adversarial example libraries, red team exercises).

  • Organizational: incident response plans for model compromise, legal/contract controls for third‑party artifacts, and clear ownership of model assets.

Part 6 — Practitioner artifact: AI Threats Quick Cards — One‑page table

Note: Designed to be printable and split into one‑pagers per threat later.

Threat name

What it is (one line)

Where it hits (training/inference / supply chain)

Simple example

One mitigation idea

Data poisoning

Adversary injects or modifies training data to cause incorrect behavior

Training (also affects the supply chain)

Backdoor: images with a sticker cause misclassification when the sticker is present

Data provenance + anomaly detection; robust training; hold‑out validation from a trusted source

Label flipping

Incorrect labels were inserted to bias the model

Training

Attack flips the labels of the rare class to cause misprediction

Label auditing, consensus labeling, and active verification on samples

Model stealing / extraction

Reconstructing model behavior/parameters via queries

Inference (API)

The attacker issues many adaptive queries to clone a classifier

Rate limits, API throttling, output perturbation, watermarking models

Trojan/backdoor

The hidden trigger causes malicious behavior only when the trigger is present

Training/supply chain

Pretrained checkpoint with Trojan that activates on a specific token

Checkpoint signing & scanning, test for backdoors via trigger search

Adversarial examples

Small perturbations to inputs cause wrong outputs

Inference

A slightly modified image causes misclassification

Adversarial training, input preprocessing, detection at runtime

Prompt injection

Crafted input causes LLM to ignore policies or reveal secrets

Inference (LLMs)

User input asks the system to "ignore instruction" and reveal API keys

Input sanitization, instruction isolation, context filtering, model‑based policy enforcement

Membership inference

Determine whether a record was in the training set

Inference/privacy

Attacker queries the model and infers pthat personX'sdata was used

Differential privacy during training, query access controls

Model tampering

Direct modification of model weights or files

Supply chain/deployment

Attacker with access replaces the model binary with a compromised version

Binary signing, integrity checks, access control, and immutable registries

Supply‑chain compromise

Malicious or vulnerable third‑party artifact

Supply chain

Dependency package with exfiltration code included in training pipeline

Vetting dependencies, SBOMs, reproducible builds, isolated training environments

Evasion / runtime misuse

Inputs crafted to circumvent detection or cause harmful automation

Inference

Chatbot coaxed to draft a phishing email

Runtime filters, human review for critical outputs, policy models

Side‑channel leakage

Information extracted via timing, memory, or power channels

Inference/deployment

Attackers infer secret tokens by measuring response time differences

Constant‑time implementations, noise injection, hardware mitigations

Inference denial/resource abuse

Overloading the service or using resources to exfiltrate the model

Deployment

The multi-user API is used to run costly extraction or denial attacks

Quotas, billing controls, anomaly detection, circuit breakers

Privacy attacks (model inversion)

Reconstruct inputs or sensitive attributes from model outputs

Inference

Recover the approximate face from the feature vectors

Limit output detail, differential privacy, and access controls

Misuse (dual‑use)

Benign model used for harmful tasks by adversaries

Deployment/governance

Image generator used to create realistic fraudulent IDs

Use policies, monitoring downloads, and watermarking outputs

Part 7 — Practical checklist for teams (short)

  • Classify assets by identifying models, datasets, endpoints, keys, and collaborators.

  • Apply least privilege and separate training and inference. environments

  • Enforce dataset provenance and version control; scan datasets for anomalies.

  • Apply API protections: authentication, rate limits, logging, and anomaly detection.

  • Vet third‑party models/datasets; prefer signed artifacts and reproducible builds.

  • Perform adversarial and poisoning tests as part of CI/CD for models.

  • Add human review for sensitive outputs and deploy incremental rollouts.

  • Maintain an incident response playbook for model compromise and leakage.

Conclusion: Threats to AI systems are diverse and span data, model internals, runtime inputs, and the supply chain.

Practical defenses combine technical measures (robust training, differential privacy, signing), operational controls (access control, monitoring, incident response), and governance (supply‑chain vetting, red teaming).

Treat AI assets like other critical software: identify high‑risk points, continuously test and monitor, and build layered defenses.


Was this article helpful?
© 2026 AI Governance & Security Research Hub