Safety·Anthropic·Aug 2022

3. Red Teaming Language Models to Reduce Harms

Showed RLHF-trained models remain vulnerable to adversarial attack, proving behavioral safety is never permanently solved.

Research Paper

Summary

Extensive study of red teaming language models. Gathered significant red teaming datasets that informed later Constitutional AI work. Explored methods to identify and catalog model vulnerabilities.

Key Concepts

Red Teaming Methodology (Human + AI-Assisted)

A two-pronged approach to finding vulnerabilities. Human red teamers (crowdworkers) use creative adversarial thinking to discover novel attacks. AI-assisted red teaming uses the model itself to generate adversarial examples, probing for edge cases. Combining both methods catches attacks that either human imagination alone or pure algorithmic searching would miss.

Attack Taxonomy (Direct, Indirect, Roleplay, Multi-Turn)

A systematic classification of attack types. Direct attacks explicitly request harmful content. Indirect attacks accomplish harmful goals through obfuscation (e.g., asking for harmful info in academic language). Roleplay attacks have the model adopt a persona that would comply with the request. Multi-turn attacks build context over multiple exchanges. This taxonomy became the standard way the field talks about adversarial LLM vulnerabilities.

Scale-Vulnerability Paradox

A counterintuitive finding: larger, more capable models are more vulnerable to adversarial attacks despite being better at understanding harmful intent. Increased capability means increased sophistication in harmful domains. This creates a fundamental tension: improving capability automatically expands the attack surface.

Behavioral Testing Limits

Red teaming can only find vulnerabilities that exist; it cannot prove absence of vulnerabilities. You can discover harms through testing, but successful testing doesn't guarantee safety. There are categories of harm the red teamers didn't think to test for, and passive vulnerabilities the model has but red teaming never uncovered.

Connections

Influenced by

1. A General Language Assistant as a Laboratory for Alignment

Dec 2021

Influences

4. Constitutional AI: Harmlessness from AI Feedback

Dec 2022

11. Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training

Jan 2024