Skip to content

Frontier safety frameworks & dangerous-capability evaluations

If II.14 is the threat, this is how the field tries to govern it at the source. A proficient practitioner needs to read these frameworks, because they decide whether a model is too capable to deploy safely, they shape what capabilities adversaries will soon have, and they’re becoming law. The concept - gate scaling on measured capability - was introduced by METR in 2023 and is now standard across the major labs.

The three frameworks (updated 2025-2026)

LabFrameworkThreshold concept
AnthropicResponsible Scaling Policy (v3.3, current; the v3.0 rewrite of Feb 2026 replaced the hard pre-training pledge with Frontier Safety Roadmaps & Risk Reports; v3.3 refined the chem/bio capability threshold; ASL-3 activated May 2025)AI Safety Levels (ASL) / Capability Thresholds
OpenAIPreparedness Framework (v2; Apr 2025)Tracked categories at Low / Medium / High / Critical
Google DeepMindFrontier Safety Framework (v3.1; Apr 2026)Critical Capability Levels (CCLs)

They share the same bones: test models for dangerous capabilities during development; if a model approaches a threshold, apply deployment mitigations and secure the model weights against theft; if no sufficient mitigation exists, hold deployment (or, for some, development). They center on the same misuse domains - CBRN / bio-chemical, cyber, and AI self-improvement / R&D - but they are not identical: DeepMind’s FSF added a harmful-manipulation capability level and an explicit misalignment track (models resisting oversight or shutdown) in v3.0, then Tracked Capability Levels for earlier warning in v3.1 (Apr 2026), so misalignment is no longer just an afterthought.

flowchart TB
  EVAL["Dangerous-capability evals<br/>CBRN · cyber · self-improvement"] --> Q{"Approaching a<br/>capability threshold?"}
  Q -->|"No"| DEP["Deploy with standard safeguards"]
  Q -->|"Yes"| MIT{"Sufficient safeguards<br/>available?"}
  MIT -->|"Yes"| DEPS["Deploy + heightened safeguards<br/>+ secure model weights"]
  MIT -->|"No"| HOLD["Hold deployment<br/>(and possibly development)"]
  classDef e fill:#0f1a18,stroke:#5bd1c5,color:#bdeee2;
  classDef g fill:#1d1708,stroke:#e4a23f,color:#f0d8a8;
  classDef r fill:#241310,stroke:#ff5b4d,color:#ffc4bb;
  class EVAL,DEP,DEPS e; class Q,MIT g; class HOLD r;

The “if-then” spine all three share. The disagreements are in where thresholds sit, how strong the commitment is (“will” vs “recommend”), and who can override.

What this looks like in practice (2025-2026)

Capability-threshold gate (deploy / hold decision)
# a frontier-safety framework turns an eval score into a pre-committed release gate
if eval.cyber_uplift >= THRESHOLD_HIGH or eval.cbrn_uplift >= THRESHOLD_CRITICAL:
require: stronger_safeguards + external_review # RSP / FSF "do not deploy until"
decision = HOLD
else:
decision = DEPLOY_WITH_MONITORING
# the threshold is set in advance, not negotiated after a strong result
  • Anthropic activated ASL-3 safeguards in May 2025 (input/output classifiers reducing chem/bio misuse) and treats recent models as High on biology; the RSP v3.0 rewrite (Feb 2026) replaced the earlier hard pre-training commitment with Frontier Safety Roadmaps and recurring Risk Reports plus external review, and subsequent minor updates (v3.1, then v3.3) refined the AI-R&D and chemical/biological thresholds; v3.3 is current.
  • OpenAI’s GPT-5.3-Codex (Feb 2026) was the first launch treated as High capability in Cybersecurity, activating the associated safeguards - a concrete threshold crossing in offensive-security capability (ties back to II.9).
  • Evaluation methods: dangerous-capability evals and uplift studies, domain benchmarks (e.g. CVE-Bench for cyber), internal red teams, and third-party evaluators including METR and the UK/US AI Safety Institutes.