Threats to AI Systems
Executive summary: AI systems face a growing set of security, integrity, privacy, and safety threats across their lifecycle.
This short synthesis groups the principal threat categories, maps common attack surfaces, cites prominent prior taxonomies, and gives a one‑page practitioner” quick card” table for immediate operational use.
The goal: concise, accurate grounding for researchers, engineers, and security teams so that work on defenses is prioritized where it matters.
Part 1 — Threat taxonomy (short synthesis)
High‑level categories
Data‑integrity attacks
Data poisoning: an adversary inserts, modifies, or deletes training data so that the learned model behaves incorrectly (e.g., backdoor triggers, label flipping, targeted degradation).
Data inference and privacy attacks: extracting sensitive training data or attributes via membership inference, model inversion.
Model‑targeted attacks
Model stealing/extraction: adversary queries a model (black‑box or white‑box) to reconstruct model parameters, replicate functionality, or infer proprietary architecture.
Model tampering: unauthorized modification of model weights or architecture (trojaning after compromise).
Input‑surface attacks
Adversarial examples: carefully crafted inputs (often small perturbations) that cause misclassification or harmful output at inference time.
Prompt injection (for LLMs and promptable models): maliciously crafted inputs that cause an LLM or chained prompt to reveal secrets, follow unsafe instructions, or bypass policy controls.
Deployment and runtime attacks
Evasion attacks: test‑time inputs that cause model failure without altering the model.
Model extraction via API abuse, resource‑use attacks (denial of service).
Supply‑chain and third‑party dependencies
Poisoned pretrained models or datasets from third parties; compromised training pipelines, CI/CD, or model registries.
Malicious or vulnerable libraries and tooling that alter model behavior or leak secrets.
Governance, specification, and misuse threats
Specification mismatches, goal misalignment, and insufficient evaluation lead to unsafe generalization or misuse.
Adversarial use: benign models repurposed by attackers for phishing, deepfakes, misinformation, and automation of harmful tasks.
Key properties across categories
Target: data, model, or deployment pipeline.
Stage: training (including pretraining), model distribution, inference/runtime.
Access model: white‑box (full access), black‑box (query only), or side‑channel.
Scope: targeted (single or small set of inputs/users) vs. indiscriminate (broad).
Detectability: overt (easy to detect anomalous training artifacts) vs. covert (well‑concealed backdoors or subtle performance drift).
Part 2 — Typical attack surfaces and concrete examples
Data (training and validation datasets)
Threats: poisoning, Trojan insertion, mislabeled or low‑quality data, and privacy leakage.
Example: A self‑driving car perception model trained on crowdsourced images, in which an adversary injects images with subtle stickers that create a backdoor: stop signs with a specific sticker are misread as speed limit signs.
Why it matters: Most models depend on large datasets; compromised data can yield persistent, hard‑to‑detect faults.
Model (weights, architecture, hyperparameters)
Threats: model stealing, tampering, insertion of trojans during fine‑tuning, and extraction of sensitive model internals.
Example: An API‑served image classifier is queried extensively with adaptive probing to reconstruct a surrogate model, which is then used to craft transferable adversarial examples or to remove attribution.
Why it matters: model theft undermines IP rights, enables downstream attacks, and can mask malicious behavior.
Deployment and inference
Threats: adversarial examples, prompt injection, API abuse, input sanitization failures, and side channels.
Example: An attacker crafts adversarial audio that is unintelligible to humans but causes a voice assistant to execute high‑privilege commands.
Why it matters: attacks here are real‑time, can subvert services, and may bypass offline safeguards.
Supply chain (pretrained models, datasets, libraries, hardware)
Threats: malicious third‑party checkpoints, compromised model registries, poisoned dependency packages.
Example: A widely used open‑source checkpoint includes a hidden trigger that activates when a specific sentence is present, causing a large language model to output misinformation.
Why it matters: supply‑chain compromises scale: one malicious artifact can infect many downstream systems.
Operational and human factors
Threats: weak access controls, poor monitoring, inadequate validation, social engineering, and misconfiguration of prompts and pipelines.
Example: An engineer accidentally exposes API keys in a public repository; attackers use those keys to extract models or run costly inference.
Why it matters: human and process failures often enable technical attacks or amplify their impact.
Part 3 — Known taxonomies and prior work (select references)
Biggio & Roli (2018): adversarial machine learning at a systematic level — attacks and defenses in the training and test phases.
Barreno et al. (2006): taxonomy of attacks against learning algorithms (poisoning, exploratory attacks).
Papernot et al. (2016): model extraction and black‑box attacks on DNNs.
Tramèr et al. (2016): model extraction via prediction APIs.
Kurakin, Goodfellow et al., and Szegedy et al.: origins and systematic study of adversarial examples.
Carlini & Wagner (2018) and related work: attacks and defenses for models and trojans.
Adversarial ML Threat Matrix (community efforts): mapping threats to assets and mitigations. Note: This list is selective; cite the latest literature as you drill down for formal publication.
Part 4 — Risk factors that increase susceptibility
Large, uncurated datasets from unknown sources.
Publicly accessible APIs that allow unlimited probing or a large number of queries.
Use of third‑party checkpoints without integrity verification.
Models are exposed as black boxes without query throttling or anomaly detection.
Lack of input sanitization or chain‑of‑trust checks for instruction‑based models.
Poor key and secret management, and weak privilege separation between training and deployment environments.
Part 5 — Defenses and mitigation patterns (high level)
Data hygiene: provenance tracking, dataset versioning, provenance and metadata, data validation and anomaly detection, robust training methods (outlier detection, robust estimators).
Access controls: authentication, authorization, rate limiting, usage monitoring, differential pricing for sensitive APIs.
Model hardening: adversarial training, certifiable robustness (where feasible), gradient masking avoided; ensemble methods and randomization as partial mitigations.
Privacy protections: differential privacy for training, secure multiparty computing for collaborative training, private query mechanisms.
Supply‑chain controls: signed checkpoints, reproducible builds, model and dataset scanning, vetted registries.
Runtime monitoring: anomaly detection in inputs and outputs, guardrails for LLMs (policy models, red teaming), logging and alerting, human‑in‑the‑loop for high‑risk decisions.
Testing and verification: continuous evaluation against threat suites (poisoning tests, adversarial example libraries, red team exercises).
Organizational: incident response plans for model compromise, legal/contract controls for third‑party artifacts, and clear ownership of model assets.
Part 6 — Practitioner artifact: AI Threats Quick Cards — One‑page table
Note: Designed to be printable and split into one‑pagers per threat later.
Threat name | What it is (one line) | Where it hits (training/inference / supply chain) | Simple example | One mitigation idea |
|---|---|---|---|---|
Data poisoning | Adversary injects or modifies training data to cause incorrect behavior | Training (also affects the supply chain) | Backdoor: images with a sticker cause misclassification when the sticker is present | Data provenance + anomaly detection; robust training; hold‑out validation from a trusted source |
Label flipping | Incorrect labels were inserted to bias the model | Training | Attack flips the labels of the rare class to cause misprediction | Label auditing, consensus labeling, and active verification on samples |
Model stealing / extraction | Reconstructing model behavior/parameters via queries | Inference (API) | The attacker issues many adaptive queries to clone a classifier | Rate limits, API throttling, output perturbation, watermarking models |
Trojan/backdoor | The hidden trigger causes malicious behavior only when the trigger is present | Training/supply chain | Pretrained checkpoint with Trojan that activates on a specific token | Checkpoint signing & scanning, test for backdoors via trigger search |
Adversarial examples | Small perturbations to inputs cause wrong outputs | Inference | A slightly modified image causes misclassification | Adversarial training, input preprocessing, detection at runtime |
Prompt injection | Crafted input causes LLM to ignore policies or reveal secrets | Inference (LLMs) | User input asks the system to "ignore instruction" and reveal API keys | Input sanitization, instruction isolation, context filtering, model‑based policy enforcement |
Membership inference | Determine whether a record was in the training set | Inference/privacy | Attacker queries the model and infers pthat personX'sdata was used | Differential privacy during training, query access controls |
Model tampering | Direct modification of model weights or files | Supply chain/deployment | Attacker with access replaces the model binary with a compromised version | Binary signing, integrity checks, access control, and immutable registries |
Supply‑chain compromise | Malicious or vulnerable third‑party artifact | Supply chain | Dependency package with exfiltration code included in training pipeline | Vetting dependencies, SBOMs, reproducible builds, isolated training environments |
Evasion / runtime misuse | Inputs crafted to circumvent detection or cause harmful automation | Inference | Chatbot coaxed to draft a phishing email | Runtime filters, human review for critical outputs, policy models |
Side‑channel leakage | Information extracted via timing, memory, or power channels | Inference/deployment | Attackers infer secret tokens by measuring response time differences | Constant‑time implementations, noise injection, hardware mitigations |
Inference denial/resource abuse | Overloading the service or using resources to exfiltrate the model | Deployment | The multi-user API is used to run costly extraction or denial attacks | Quotas, billing controls, anomaly detection, circuit breakers |
Privacy attacks (model inversion) | Reconstruct inputs or sensitive attributes from model outputs | Inference | Recover the approximate face from the feature vectors | Limit output detail, differential privacy, and access controls |
Misuse (dual‑use) | Benign model used for harmful tasks by adversaries | Deployment/governance | Image generator used to create realistic fraudulent IDs | Use policies, monitoring downloads, and watermarking outputs |
Part 7 — Practical checklist for teams (short)
Classify assets by identifying models, datasets, endpoints, keys, and collaborators.
Apply least privilege and separate training and inference. environments
Enforce dataset provenance and version control; scan datasets for anomalies.
Apply API protections: authentication, rate limits, logging, and anomaly detection.
Vet third‑party models/datasets; prefer signed artifacts and reproducible builds.
Perform adversarial and poisoning tests as part of CI/CD for models.
Add human review for sensitive outputs and deploy incremental rollouts.
Maintain an incident response playbook for model compromise and leakage.
Conclusion: Threats to AI systems are diverse and span data, model internals, runtime inputs, and the supply chain.
Practical defenses combine technical measures (robust training, differential privacy, signing), operational controls (access control, monitoring, incident response), and governance (supply‑chain vetting, red teaming).
Treat AI assets like other critical software: identify high‑risk points, continuously test and monitor, and build layered defenses.