Replaced human annotators with AI self-critique guided by written principles, making alignment cheaper and more scalable.
Research PaperIntroduced Constitutional AI (CAI) and RLAIF (Reinforcement Learning from AI Feedback). Trained harmless AI using ~10 natural language principles with no human labels for harmful outputs. RLAIF became a default method in the post-training literature. Kickstarted the broader field of AI self-improvement for alignment.
A method for training harmless AI without extensive human labeling of harmful examples. The AI operates under a written constitution of principles (e.g., "be helpful and harmless"), evaluates its own outputs against these principles, and revises them. This replaces human-in-the-loop preference labeling with AI self-criticism, scaling cost sublinearly with model capability.
The operational mechanism of RLAIF. The model (1) generates a response, (2) critiques the response against constitutional principles, and (3) revises based on its own critique. This loop is elegant because it leverages the model's reasoning capabilities—if it can understand why something is harmful, it can often fix it.
An explicit set of \~10 natural language values that guide the training process (e.g., "be honest," "respect the user's autonomy," "avoid illegal content"). Unlike RLHF where preferences are implicit in human labels, the constitution is transparent and auditable. You can read exactly what values are being trained.
Leveraging the model's own reasoning to supervise its training, rather than relying on external human judgment. This works because pre-training gives models sophisticated ethical reasoning; the constitution activates this latent capability consistently. It scales with compute, not with human annotation labor.
The fundamental limitation of CAI: the constitution encodes specific values chosen by Anthropic, not universal or democratic values. The approach is transparent about this (you can see the constitution), but it still concentrates value choices in a single organization's hands. There's no principled answer to who should set these principles in a diverse society.