Introduced AI Safety Levels (ASL-1 through ASL-4) with mandatory capability evaluations before scaling up.
PolicyIntroduced the AI Safety Levels (ASL) framework, modeled after biosafety levels. Defined commitments to not train or deploy models without adequate safety measures. Classified Claude 2 as ASL-2. Became a model for industry responsible scaling commitments.
A tiered classification system for AI systems based on capability and risk. ASL-1 is current models with minimal risk. ASL-2/3 represent increasing capability and require additional safety evaluations (red teaming, capability assessments). ASL-4 represents potential transformation-scale AI and triggers major additional safeguards. The framework borrows from biosafety levels, providing intuitive scalability to future more powerful systems.
Concrete, falsifiable pledges that link capability milestones to safety requirements. For example: "If a model achieves ASL-3 capability levels, then we will conduct X red teaming tests before deployment." These commitments create accountability—you can verify if Anthropic follows them.
Systematic assessments of what a model can do, used to determine its ASL classification. Evaluations measure not just benchmark performance but specific risky capabilities (ability to help with bioweapons, autonomy, deception). ASL level depends on what the model can actually accomplish, not just raw scale.
The RSP is a voluntary commitment with no external enforcement mechanism. Anthropic evaluates its own models, sets its own ASL levels, and decides what safeguards are adequate. This creates social pressure on competitors and shapes regulatory expectations, but relies on Anthropic's internal commitment rather than external constraint.