Proved RLHF scales most favorably with model size and that aligned models can outperform unaligned ones.
Research PaperAnthropic's first alignment paper. Compared scaling trends for alignment via prompting, imitation learning, and preference modeling. Established foundational methodology for training helpful, harmless, and honest AI assistants.
Helpful: Makes a clear attempt to answer; asks follow-up questions when needed; redirects ill-informed requests. Harmless: Refuses dangerous requests; recognizes disguised attempts; flags sensitive/consequential advice. Honest: Gives accurate information; is calibrated on uncertainty; transparent about its own capabilities and limitations. The three axes inherently conflict — a maximally helpful model may be harmful, a maximally harmless one becomes useless. Managing this tension is the core challenge that drives all subsequent Anthropic work.
Prompting barely works on small models, helps on large ones — behavior is a runtime patch, not a learned change. SFT (imitation learning) improves steadily with scale but is capped by demonstrator quality. RLHF (ranked preference modeling) shows the steepest improvement curve — it becomes more effective as models grow, because larger models can represent nuanced preferences better and generalize them more reliably. This result told the field RLHF would become more important, not less, as models scaled up. Safety needs to scale at least as fast as capability.
The performance cost of making a model aligned. If alignment degrades capability, labs won't invest in it. This paper showed RLHF actually improves performance on standard NLP benchmarks — a negative alignment tax. Aligned models score better because RLHF training teaches them to follow instructions more precisely, which transfers to general tasks. This gave every lab economic permission to invest in safety without fearing capability loss.
Three training objectives were tested: Imitation learning: mimic good examples. Token-level labels, moderate scaling. Binary discrimination: "A is better than B." Collapses preference to 1 bit — scales no better than imitation despite using human feedback. Ranked preferences: full ordering A > B > C > D. Produces k(k-1)/2 comparison pairs from k responses (4 responses = 6 pairs). Dramatically more information-dense — and scales most favorably. This is why RLHF uses rankings, not binary labels.
Human preference labels are scarce and expensive. PMP uses two stages: (1) pre-train the reward model on large, cheap proxy data — Stack Exchange upvotes, Reddit scores, Wikipedia edits — to learn general text quality features like coherence and relevance; (2) fine-tune on the small set of actual human preference labels to specialize for safety, honesty, and nuance. Far more sample-efficient than training from scratch. This "bootstrap cheap, refine expensive" pattern recurs in Constitutional AI, RLAIF, and sparse autoencoders.