Threats to AI Systems

Executive summary: AI systems face a growing set of security, integrity, privacy, and safety threats across their lifecycle.

This short synthesis groups the principal threat categories, maps common attack surfaces, cites prominent prior taxonomies, and gives a one‑page practitioner” quick card” table for immediate operational use.

The goal: concise, accurate grounding for researchers, engineers, and security teams so that work on defenses is prioritized where it matters.

Part 1 — Threat taxonomy (short synthesis)

High‑level categories

Data‑integrity attacks
- Data poisoning: an adversary inserts, modifies, or deletes training data so that the learned model behaves incorrectly (e.g., backdoor triggers, label flipping, targeted degradation).
- Data inference and privacy attacks: extracting sensitive training data or attributes via membership inference, model inversion.
Model‑targeted attacks
- Model stealing/extraction: adversary queries a model (black‑box or white‑box) to reconstruct model parameters, replicate functionality, or infer proprietary architecture.
- Model tampering: unauthorized modification of model weights or architecture (trojaning after compromise).
Input‑surface attacks
- Adversarial examples: carefully crafted inputs (often small perturbations) that cause misclassification or harmful output at inference time.
- Prompt injection (for LLMs and promptable models): maliciously crafted inputs that cause an LLM or chained prompt to reveal secrets, follow unsafe instructions, or bypass policy controls.
Deployment and runtime attacks
- Evasion attacks: test‑time inputs that cause model failure without altering the model.
- Model extraction via API abuse, resource‑use attacks (denial of service).
Supply‑chain and third‑party dependencies
- Poisoned pretrained models or datasets from third parties; compromised training pipelines, CI/CD, or model registries.
- Malicious or vulnerable libraries and tooling that alter model behavior or leak secrets.
Governance, specification, and misuse threats
- Specification mismatches, goal misalignment, and insufficient evaluation lead to unsafe generalization or misuse.
- Adversarial use: benign models repurposed by attackers for phishing, deepfakes, misinformation, and automation of harmful tasks.

Key properties across categories

Target: data, model, or deployment pipeline.
Stage: training (including pretraining), model distribution, inference/runtime.
Access model: white‑box (full access), black‑box (query only), or side‑channel.
Scope: targeted (single or small set of inputs/users) vs. indiscriminate (broad).
Detectability: overt (easy to detect anomalous training artifacts) vs. covert (well‑concealed backdoors or subtle performance drift).

Part 2 — Typical attack surfaces and concrete examples

Data (training and validation datasets)

Threats: poisoning, Trojan insertion, mislabeled or low‑quality data, and privacy leakage.
Example: A self‑driving car perception model trained on crowdsourced images, in which an adversary injects images with subtle stickers that create a backdoor: stop signs with a specific sticker are misread as speed limit signs.
Why it matters: Most models depend on large datasets; compromised data can yield persistent, hard‑to‑detect faults.

Model (weights, architecture, hyperparameters)

Threats: model stealing, tampering, insertion of trojans during fine‑tuning, and extraction of sensitive model internals.
Example: An API‑served image classifier is queried extensively with adaptive probing to reconstruct a surrogate model, which is then used to craft transferable adversarial examples or to remove attribution.
Why it matters: model theft undermines IP rights, enables downstream attacks, and can mask malicious behavior.

Deployment and inference

Threats: adversarial examples, prompt injection, API abuse, input sanitization failures, and side channels.
Example: An attacker crafts adversarial audio that is unintelligible to humans but causes a voice assistant to execute high‑privilege commands.
Why it matters: attacks here are real‑time, can subvert services, and may bypass offline safeguards.

Supply chain (pretrained models, datasets, libraries, hardware)

Threats: malicious third‑party checkpoints, compromised model registries, poisoned dependency packages.
Example: A widely used open‑source checkpoint includes a hidden trigger that activates when a specific sentence is present, causing a large language model to output misinformation.
Why it matters: supply‑chain compromises scale: one malicious artifact can infect many downstream systems.

Operational and human factors

Threats: weak access controls, poor monitoring, inadequate validation, social engineering, and misconfiguration of prompts and pipelines.
Example: An engineer accidentally exposes API keys in a public repository; attackers use those keys to extract models or run costly inference.
Why it matters: human and process failures often enable technical attacks or amplify their impact.

Part 3 — Known taxonomies and prior work (select references)

Biggio & Roli (2018): adversarial machine learning at a systematic level — attacks and defenses in the training and test phases.
Barreno et al. (2006): taxonomy of attacks against learning algorithms (poisoning, exploratory attacks).
Papernot et al. (2016): model extraction and black‑box attacks on DNNs.
Tramèr et al. (2016): model extraction via prediction APIs.
Kurakin, Goodfellow et al., and Szegedy et al.: origins and systematic study of adversarial examples.
Carlini & Wagner (2018) and related work: attacks and defenses for models and trojans.
Adversarial ML Threat Matrix (community efforts): mapping threats to assets and mitigations. Note: This list is selective; cite the latest literature as you drill down for formal publication.

Part 4 — Risk factors that increase susceptibility

Large, uncurated datasets from unknown sources.
Publicly accessible APIs that allow unlimited probing or a large number of queries.
Use of third‑party checkpoints without integrity verification.
Models are exposed as black boxes without query throttling or anomaly detection.
Lack of input sanitization or chain‑of‑trust checks for instruction‑based models.
Poor key and secret management, and weak privilege separation between training and deployment environments.

Part 5 — Defenses and mitigation patterns (high level)

Data hygiene: provenance tracking, dataset versioning, provenance and metadata, data validation and anomaly detection, robust training methods (outlier detection, robust estimators).
Access controls: authentication, authorization, rate limiting, usage monitoring, differential pricing for sensitive APIs.
Model hardening: adversarial training, certifiable robustness (where feasible), gradient masking avoided; ensemble methods and randomization as partial mitigations.
Privacy protections: differential privacy for training, secure multiparty computing for collaborative training, private query mechanisms.
Supply‑chain controls: signed checkpoints, reproducible builds, model and dataset scanning, vetted registries.
Runtime monitoring: anomaly detection in inputs and outputs, guardrails for LLMs (policy models, red teaming), logging and alerting, human‑in‑the‑loop for high‑risk decisions.
Testing and verification: continuous evaluation against threat suites (poisoning tests, adversarial example libraries, red team exercises).
Organizational: incident response plans for model compromise, legal/contract controls for third‑party artifacts, and clear ownership of model assets.

Part 6 — Practitioner artifact: AI Threats Quick Cards — One‑page table

Note: Designed to be printable and split into one‑pagers per threat later.

Threat name	What it is (one line)	Where it hits (training/inference / supply chain)	Simple example	One mitigation idea
Data poisoning	Adversary injects or modifies training data to cause incorrect behavior	Training (also affects the supply chain)	Backdoor: images with a sticker cause misclassification when the sticker is present	Data provenance + anomaly detection; robust training; hold‑out validation from a trusted source
Label flipping	Incorrect labels were inserted to bias the model	Training	Attack flips the labels of the rare class to cause misprediction	Label auditing, consensus labeling, and active verification on samples
Model stealing / extraction	Reconstructing model behavior/parameters via queries	Inference (API)	The attacker issues many adaptive queries to clone a classifier	Rate limits, API throttling, output perturbation, watermarking models
Trojan/backdoor	The hidden trigger causes malicious behavior only when the trigger is present	Training/supply chain	Pretrained checkpoint with Trojan that activates on a specific token	Checkpoint signing & scanning, test for backdoors via trigger search
Adversarial examples	Small perturbations to inputs cause wrong outputs	Inference	A slightly modified image causes misclassification	Adversarial training, input preprocessing, detection at runtime
Prompt injection	Crafted input causes LLM to ignore policies or reveal secrets	Inference (LLMs)	User input asks the system to "ignore instruction" and reveal API keys	Input sanitization, instruction isolation, context filtering, model‑based policy enforcement
Membership inference	Determine whether a record was in the training set	Inference/privacy	Attacker queries the model and infers pthat personX'sdata was used	Differential privacy during training, query access controls
Model tampering	Direct modification of model weights or files	Supply chain/deployment	Attacker with access replaces the model binary with a compromised version	Binary signing, integrity checks, access control, and immutable registries
Supply‑chain compromise	Malicious or vulnerable third‑party artifact	Supply chain	Dependency package with exfiltration code included in training pipeline	Vetting dependencies, SBOMs, reproducible builds, isolated training environments
Evasion / runtime misuse	Inputs crafted to circumvent detection or cause harmful automation	Inference	Chatbot coaxed to draft a phishing email	Runtime filters, human review for critical outputs, policy models
Side‑channel leakage	Information extracted via timing, memory, or power channels	Inference/deployment	Attackers infer secret tokens by measuring response time differences	Constant‑time implementations, noise injection, hardware mitigations
Inference denial/resource abuse	Overloading the service or using resources to exfiltrate the model	Deployment	The multi-user API is used to run costly extraction or denial attacks	Quotas, billing controls, anomaly detection, circuit breakers
Privacy attacks (model inversion)	Reconstruct inputs or sensitive attributes from model outputs	Inference	Recover the approximate face from the feature vectors	Limit output detail, differential privacy, and access controls
Misuse (dual‑use)	Benign model used for harmful tasks by adversaries	Deployment/governance	Image generator used to create realistic fraudulent IDs	Use policies, monitoring downloads, and watermarking outputs

Part 7 — Practical checklist for teams (short)

Classify assets by identifying models, datasets, endpoints, keys, and collaborators.
Apply least privilege and separate training and inference. environments
Enforce dataset provenance and version control; scan datasets for anomalies.
Apply API protections: authentication, rate limits, logging, and anomaly detection.
Vet third‑party models/datasets; prefer signed artifacts and reproducible builds.
Perform adversarial and poisoning tests as part of CI/CD for models.
Add human review for sensitive outputs and deploy incremental rollouts.
Maintain an incident response playbook for model compromise and leakage.

Conclusion: Threats to AI systems are diverse and span data, model internals, runtime inputs, and the supply chain.

Practical defenses combine technical measures (robust training, differential privacy, signing), operational controls (access control, monitoring, incident response), and governance (supply‑chain vetting, red teaming).

Treat AI assets like other critical software: identify high‑risk points, continuously test and monitor, and build layered defenses.