Alignment·Anthropic·Dec 2022

★4. Constitutional AI: Harmlessness from AI Feedback

Replaced human annotators with AI self-critique guided by written principles, making alignment cheaper and more scalable.

Research Paper

Summary

Introduced Constitutional AI (CAI) and RLAIF (Reinforcement Learning from AI Feedback). Trained harmless AI using ~10 natural language principles with no human labels for harmful outputs. RLAIF became a default method in the post-training literature. Kickstarted the broader field of AI self-improvement for alignment.

Key Concepts

Constitutional AI / RLAIF

A method for training harmless AI without extensive human labeling of harmful examples. The AI operates under a written constitution of principles (e.g., "be helpful and harmless"), evaluates its own outputs against these principles, and revises them. This replaces human-in-the-loop preference labeling with AI self-criticism, scaling cost sublinearly with model capability.

Critique-Revision Loop

The operational mechanism of RLAIF. The model (1) generates a response, (2) critiques the response against constitutional principles, and (3) revises based on its own critique. This loop is elegant because it leverages the model's reasoning capabilities—if it can understand why something is harmful, it can often fix it.

The Constitution (Written Principles)

An explicit set of \~10 natural language values that guide the training process (e.g., "be honest," "respect the user's autonomy," "avoid illegal content"). Unlike RLHF where preferences are implicit in human labels, the constitution is transparent and auditable. You can read exactly what values are being trained.

Self-Supervision for Alignment

Leveraging the model's own reasoning to supervise its training, rather than relying on external human judgment. This works because pre-training gives models sophisticated ethical reasoning; the constitution activates this latent capability consistently. It scales with compute, not with human annotation labor.

"Who Writes the Constitution?" Problem

The fundamental limitation of CAI: the constitution encodes specific values chosen by Anthropic, not universal or democratic values. The approach is transparent about this (you can see the constitution), but it still concentrates value choices in a single organization's hands. There's no principled answer to who should set these principles in a diverse society.