Alignment·Anthropic·Apr 2022

★2. Training a Helpful and Harmless Assistant with RLHF

Demonstrated iterated online RLHF improves both alignment and capability, then released the HH-RLHF dataset publicly.

Research Paper

Summary

Anthropic's second alignment paper. Applied RLHF to finetune language models as helpful and harmless assistants. Found alignment training improves NLP performance. Introduced iterated online training with weekly human feedback updates. Released the HH-RLHF dataset publicly.

Key Concepts

RLHF Pipeline (Reward Model + PPO)

The core technical approach: first train a supervised reward model on human preference labels (which response is better), then use reinforcement learning (PPO algorithm) to optimize the language model's outputs to maximize the reward model's predictions. This creates a feedback loop where human judgments shape model behavior without requiring human annotation for every generated response.

Helpful-Harmless Tension

A fundamental tradeoff discovered in the paper: training a model to be maximally harmless (refusing unsafe requests) tends to reduce helpfulness (refusing legitimate requests too). The paper quantified this tension empirically but didn't solve it—that required Constitutional AI's approach to decouple these objectives.

Iterated Online Training

Instead of a single round of offline RLHF, Anthropic collected human feedback continuously on a weekly schedule. This meant the reward model improved over time as annotators evaluated newly generated responses, allowing the training process itself to become more aligned with human preferences each iteration. Most competitors did one-shot RLHF; iteration proved significantly more effective.

HH-RLHF Dataset

A publicly released dataset of curated human preferences on helpful vs. harmless responses. The dataset became a standard fine-tuning resource that influenced alignment training across the industry, meaning Anthropic's specific value framework was encoded into models developed by competing organizations.

Alignment Tax (Negative — Alignment Improves Performance)

Contrary to the expectation that safety training would degrade performance, the paper found that RLHF-trained models actually improved on standard NLP benchmarks compared to base models. This was crucial for adoption because it removed the primary argument against safety training: it didn't cost performance.

Connections

Influenced by

1. A General Language Assistant as a Laboratory for Alignment

Dec 2021