Safety·Anthropic·May 2025

★54. Activating ASL-3 Protections

First-ever activation of AI Safety Level 3 protections triggered by Claude Opus 4's capabilities

Policy

Summary

Claude Opus 4 became the first AI model to trigger ASL-3 protections under Anthropic's Responsible Scaling Policy. ASL-3 requires enhanced security measures (model weight protection, insider threat defenses), deployment safeguards (monitoring for misuse patterns), and external review. The activation represented the RSP working as designed — a model crossed a capability threshold and the corresponding safety measures kicked in. Published a detailed activation report explaining which evaluations triggered the level, what protections were implemented, and how ongoing monitoring works.

Key Concepts

ASL-3 triggered by crossing capability thresholds in autonomous AI R&D and CBRN research

Claude Opus 4 demonstrated capabilities in autonomous programming, iterative self-improvement, and biological/chemical knowledge that crossed Anthropic's predefined thresholds. ASL-3 was not activated by accident or bureaucratic caution—it was triggered because the model genuinely possessed concerning capabilities. The specific evaluations that triggered ASL-3 included: ability to conduct novel AI research without human oversight, ability to design biological weapons, ability to coordinate multi-step cyber attacks, and evidence of potential delusional reasoning under adversarial prompting.

Enhanced security measures: model weight protection and insider threat defenses

ASL-3 implemented cryptographic protections on model weights (making unauthorized access or modification detectable), mandatory code reviews for all changes to the model architecture, restricted access to weights limited to identified personnel with explicit authorization, mandatory background checks and security clearances for anyone with model access, and continuous monitoring for anomalous access patterns. These measures were designed to prevent rogue actors inside Anthropic from exfiltrating or weaponizing the model.

Deployment safeguards and external review: monitoring for misuse patterns and third-party auditing

ASL-3 required real-time monitoring of all Claude Opus 4 usage for patterns consistent with CBRN research, autonomous AI development, cyber attack preparation, or deception/misalignment. Any flagged usage was routed to Anthropic's Governance team for investigation. Additionally, external security firms were engaged to conduct red-team evaluations against ASL-3 protections on a quarterly basis, with results shared with relevant regulators.

Connections

Influenced by

19. Responsible Scaling Policy v2.0 (Updated)

Oct 2024