SS
About Me
Frontier AI Paper BriefingsPersonal AI Telegram BotClinical Trial EnrollerLittle Human Names
DisclaimersPrivacy PolicyTerms of Use
Privacy Policy·Terms of Use·Disclaimers

© 2026 Silvia Seceleanu

← Back to Explorer
Safety·Anthropic·Aug 2022

3. Red Teaming Language Models to Reduce Harms

Showed RLHF-trained models remain vulnerable to adversarial attack, proving behavioral safety is never permanently solved.

Research Paper
Summary

Extensive study of red teaming language models. Gathered significant red teaming datasets that informed later Constitutional AI work. Explored methods to identify and catalog model vulnerabilities.

Key Concepts

Red Teaming Methodology (Human + AI-Assisted)

A two-pronged approach to finding vulnerabilities. Human red teamers (crowdworkers) use creative adversarial thinking to discover novel attacks. AI-assisted red teaming uses the model itself to generate adversarial examples, probing for edge cases. Combining both methods catches attacks that either human imagination alone or pure algorithmic searching would miss.

Attack Taxonomy (Direct, Indirect, Roleplay, Multi-Turn)

A systematic classification of attack types. Direct attacks explicitly request harmful content. Indirect attacks accomplish harmful goals through obfuscation (e.g., asking for harmful info in academic language). Roleplay attacks have the model adopt a persona that would comply with the request. Multi-turn attacks build context over multiple exchanges. This taxonomy became the standard way the field talks about adversarial LLM vulnerabilities.

Scale-Vulnerability Paradox

A counterintuitive finding: larger, more capable models are more vulnerable to adversarial attacks despite being better at understanding harmful intent. Increased capability means increased sophistication in harmful domains. This creates a fundamental tension: improving capability automatically expands the attack surface.

Behavioral Testing Limits

Red teaming can only find vulnerabilities that exist; it cannot prove absence of vulnerabilities. You can discover harms through testing, but successful testing doesn't guarantee safety. There are categories of harm the red teamers didn't think to test for, and passive vulnerabilities the model has but red teaming never uncovered.

Connections

3. Red Teaming Lang…Aug 20221. A General Langua…Dec 20214. Constitutional A…Dec 202211. Sleeper Agents: …Jan 2024Influenced byInfluences
Influenced by
1. A General Language Assistant as a Laboratory for Alignment
Dec 2021
Influences
4. Constitutional AI: Harmlessness from AI Feedback
Dec 2022
11. Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training
Jan 2024