SS
About Me
Frontier AI Paper BriefingsPokebowlClinical Trial EnrollerLittle Human Names
DisclaimersPrivacy PolicyTerms of Use
Privacy Policy·Terms of Use·Disclaimers

© 2026 Silvia Seceleanu

← Back to Explorer
Safety & Alignment·Anthropic·Dec 2022

★4. Constitutional AI: Harmlessness from AI Feedback

Replaced human annotators with AI self-critique guided by written principles, making alignment cheaper and more scalable.

Research Paper
Summary

Introduced Constitutional AI (CAI) and RLAIF (Reinforcement Learning from AI Feedback). Trained harmless AI using ~10 natural language principles with no human labels for harmful outputs. RLAIF became a default method in the post-training literature. Kickstarted the broader field of AI self-improvement for alignment.

Key Concepts

Constitutional AI / RLAIF

A method for training harmless AI without extensive human labeling of harmful examples. The AI operates under a written constitution of principles (e.g., "be helpful and harmless"), evaluates its own outputs against these principles, and revises them. This replaces human-in-the-loop preference labeling with AI self-criticism, scaling cost sublinearly with model capability.

Critique-Revision Loop

The operational mechanism of RLAIF. The model (1) generates a response, (2) critiques the response against constitutional principles, and (3) revises based on its own critique. This loop is elegant because it leverages the model's reasoning capabilities—if it can understand why something is harmful, it can often fix it.

The Constitution (Written Principles)

An explicit set of \~10 natural language values that guide the training process (e.g., "be honest," "respect the user's autonomy," "avoid illegal content"). Unlike RLHF where preferences are implicit in human labels, the constitution is transparent and auditable. You can read exactly what values are being trained.

Self-Supervision for Alignment

Leveraging the model's own reasoning to supervise its training, rather than relying on external human judgment. This works because pre-training gives models sophisticated ethical reasoning; the constitution activates this latent capability consistently. It scales with compute, not with human annotation labor.

"Who Writes the Constitution?" Problem

The fundamental limitation of CAI: the constitution encodes specific values chosen by Anthropic, not universal or democratic values. The approach is transparent about this (you can see the constitution), but it still concentrates value choices in a single organization's hands. There's no principled answer to who should set these principles in a diverse society.

Connections

4. Constitutional A…Dec 20223. Red Teaming Lang…Aug 202210. Collective Const…Oct 202311. Sleeper Agents: …Jan 202413. Many-Shot Jailbr…Apr 2024Influenced byInfluences
Influenced by
3. Red Teaming Language Models to Reduce Harms
Aug 2022
Influences
10. Collective Constitutional AI: Aligning a Language Model with Public Input
Oct 2023
11. Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training
Jan 2024
13. Many-Shot Jailbreaking
Apr 2024