Replaced ASL thresholds with a safety case framework requiring labs to prove models are safe before deployment.
PolicyMajor update introducing flexible capability thresholds, safety case methodologies, and new internal governance measures. Satisfied White House Voluntary Commitments (2023) and Frontier AI Safety Commitments (2024). Added external input mechanisms.
The revised capability-defined levels at which models require escalating safety measures and evaluation rigor. ASL-3 represents the threshold where models gain automated capabilities, multi-step planning ability, and potential for deception—triggering more stringent oversight. ASL-4 and beyond involve even more dangerous capabilities. The "thresholds approaching" indicates that frontier models in 2024-2025 are nearing or meeting ASL-3 criteria, making those evaluations operationally critical. The update reflected concrete experience with where models actually sit on the scale.
A two-track evaluation system where capability assessments measure what the model can do (coding, planning, deception) and safeguard assessments measure whether those capabilities lead to harmful outcomes. A model might have high planning capability but strong safeguards against misuse. By disaggregating capability and safeguard evaluations, RSP v2.0 allows for more nuanced deployment decisions: a highly capable but well-safeguarded model might deploy with restrictions; a less capable but poorly safeguarded model might face stronger constraints.
The introduction of external oversight mechanisms where safety cases and evaluations are examined by auditors independent of Anthropic's internal teams. Third-party auditing adds accountability and reduces incentives for internal teams to rationalize deployment decisions. However, the utility of auditing depends on auditors' expertise—few external parties can meaningfully evaluate the technical evidence in complex safety cases. Auditing works best as a check on process integrity rather than technical judgment.
A structured escalation system where capability breaches or safety concerns trigger proportional responses ranging from additional monitoring and restrictions, to deployment pauses, to rollback. Rather than binary deployment/no-deployment decisions, graduated responses allow for fine-grained calibration. If a model shows unexpected behavior, interventions start light (more monitoring) and escalate only if problems persist. This reduces both false positives (over-constraining safe systems) and false negatives (missing real harms) by allowing iterative adjustment.