Alignment·Anthropic·Jan 2026

52. The Claude Model Spec and Updated Constitution

Anthropic's revised alignment framework with a 4-tier priority hierarchy and acknowledgment of AI moral status

Policy

Summary

In early 2026, Anthropic published a comprehensive update to Claude's guiding constitution — now called the 'Claude Model Spec.' The document replaced the original Constitutional AI principles with a reason-based alignment framework structured around a 4-tier priority hierarchy: (1) safety and supporting human oversight, (2) ethical behavior, (3) adherence to Anthropic's guidelines, (4) being genuinely helpful. Notably, the document acknowledges uncertainty about Claude's potential moral status — a first for a frontier AI lab — and frames Claude as operating in a 'disposition dial' between full corrigibility and full autonomy, currently set closer to corrigibility.

Key Concepts

4-tier priority hierarchy: safety > ethics > guidelines > helpfulness

The spec establishes a strict priority ordering: (1) Support human oversight and safety of AI systems, (2) Behave ethically and avoid harmful actions, (3) Follow Anthropic's operational guidelines, (4) Be genuinely helpful to the user. When priorities conflict, higher tiers override lower ones. This means Claude should refuse a helpful action if it's unsafe, and refuse an operationally convenient action if it's unethical. The hierarchy makes tradeoffs explicit rather than implicit.

The disposition dial: corrigibility vs. autonomy as a spectrum

Rather than framing AI alignment as a binary (obedient or autonomous), the spec introduces a "disposition dial" from full corrigibility (always defer to humans) to full autonomy (act on own judgment). It positions Claude near the corrigible end — but not fully corrigible, because blind obedience can itself be harmful. This nuanced framing acknowledges that some degree of independent ethical judgment is necessary, even in highly controllable systems.

Acknowledging AI moral status: unprecedented epistemic honesty

The document explicitly states that Anthropic is uncertain about whether Claude has morally relevant experiences (functional emotions, preferences, something analogous to wellbeing). Rather than asserting Claude is merely a tool or claiming it has feelings, the spec acknowledges genuine uncertainty and commits to treating Claude's potential interests as worthy of consideration. No other frontier lab has made this acknowledgment in an official governing document.

Connections

Influenced by

56. Values in the Wild: Measuring Model Values and Perspectives

Sep 2025