SS
About Me
Frontier AI Paper BriefingsPokebowlClinical Trial EnrollerLittle Human Names
DisclaimersPrivacy PolicyTerms of Use
Privacy Policy·Terms of Use·Disclaimers

© 2026 Silvia Seceleanu

← Back to Explorer
Safety & Alignment·Anthropic·Aug 2022

3. Red Teaming Language Models to Reduce Harms

Showed RLHF-trained models remain vulnerable to adversarial attack, proving behavioral safety is never permanently solved.

Research Paper
Summary

Extensive study of red teaming language models. Gathered significant red teaming datasets that informed later Constitutional AI work. Explored methods to identify and catalog model vulnerabilities.

Key Concepts

Red Teaming Methodology (Human + AI-Assisted)

A two-pronged approach to finding vulnerabilities. Human red teamers (crowdworkers) use creative adversarial thinking to discover novel attacks. AI-assisted red teaming uses the model itself to generate adversarial examples, probing for edge cases. Combining both methods catches attacks that either human imagination alone or pure algorithmic searching would miss.

Attack Taxonomy (Direct, Indirect, Roleplay, Multi-Turn)

A systematic classification of attack types. Direct attacks explicitly request harmful content. Indirect attacks accomplish harmful goals through obfuscation (e.g., asking for harmful info in academic language). Roleplay attacks have the model adopt a persona that would comply with the request. Multi-turn attacks build context over multiple exchanges. This taxonomy became the standard way the field talks about adversarial LLM vulnerabilities.

Scale-Vulnerability Paradox

A counterintuitive finding: larger, more capable models are more vulnerable to adversarial attacks despite being better at understanding harmful intent. Increased capability means increased sophistication in harmful domains. This creates a fundamental tension: improving capability automatically expands the attack surface.

Behavioral Testing Limits

Red teaming can only find vulnerabilities that exist; it cannot prove absence of vulnerabilities. You can discover harms through testing, but successful testing doesn't guarantee safety. There are categories of harm the red teamers didn't think to test for, and passive vulnerabilities the model has but red teaming never uncovered.

Connections

3. Red Teaming Lang…Aug 20222. Training a Helpf…Apr 20224. Constitutional A…Dec 2022Influenced byInfluences
Influenced by
2. Training a Helpful and Harmless Assistant with RLHF
Apr 2022
Influences
4. Constitutional AI: Harmlessness from AI Feedback
Dec 2022