Alignment·Anthropic·Dec 2021

1. A General Language Assistant as a Laboratory for Alignment

Proved RLHF scales most favorably with model size and that aligned models can outperform unaligned ones.

Research Paper

Summary

Anthropic's first alignment paper. Compared scaling trends for alignment via prompting, imitation learning, and preference modeling. Established foundational methodology for training helpful, harmless, and honest AI assistants.

Key Concepts

HHH Framework (Helpful, Harmless, Honest)

Helpful: Makes a clear attempt to answer; asks follow-up questions when needed; redirects ill-informed requests. Harmless: Refuses dangerous requests; recognizes disguised attempts; flags sensitive/consequential advice. Honest: Gives accurate information; is calibrated on uncertainty; transparent about its own capabilities and limitations. The three axes inherently conflict — a maximally helpful model may be harmful, a maximally harmless one becomes useless. Managing this tension is the core challenge that drives all subsequent Anthropic work.

Alignment Scaling Laws

Prompting barely works on small models, helps on large ones — behavior is a runtime patch, not a learned change. SFT (imitation learning) improves steadily with scale but is capped by demonstrator quality. RLHF (ranked preference modeling) shows the steepest improvement curve — it becomes more effective as models grow, because larger models can represent nuanced preferences better and generalize them more reliably. This result told the field RLHF would become more important, not less, as models scaled up. Safety needs to scale at least as fast as capability.

The Alignment Tax

The performance cost of making a model aligned. If alignment degrades capability, labs won't invest in it. This paper showed RLHF actually improves performance on standard NLP benchmarks — a negative alignment tax. Aligned models score better because RLHF training teaches them to follow instructions more precisely, which transfers to general tasks. This gave every lab economic permission to invest in safety without fearing capability loss.

Ranked Preference Modeling vs. Binary Discrimination

Preference Model Pre-Training (PMP)

Human preference labels are scarce and expensive. PMP uses two stages: (1) pre-train the reward model on large, cheap proxy data — Stack Exchange upvotes, Reddit scores, Wikipedia edits — to learn general text quality features like coherence and relevance; (2) fine-tune on the small set of actual human preference labels to specialize for safety, honesty, and nuance. Far more sample-efficient than training from scratch. This "bootstrap cheap, refine expensive" pattern recurs in Constitutional AI, RLAIF, and sparse autoencoders.

Connections

Influences

2. Training a Helpful and Harmless Assistant with RLHF

Apr 2022

3. Red Teaming Language Models to Reduce Harms

Aug 2022