Showed RLHF-trained models remain vulnerable to adversarial attack, proving behavioral safety is never permanently solved.
Research PaperExtensive study of red teaming language models. Gathered significant red teaming datasets that informed later Constitutional AI work. Explored methods to identify and catalog model vulnerabilities.
A two-pronged approach to finding vulnerabilities. Human red teamers (crowdworkers) use creative adversarial thinking to discover novel attacks. AI-assisted red teaming uses the model itself to generate adversarial examples, probing for edge cases. Combining both methods catches attacks that either human imagination alone or pure algorithmic searching would miss.
A systematic classification of attack types. Direct attacks explicitly request harmful content. Indirect attacks accomplish harmful goals through obfuscation (e.g., asking for harmful info in academic language). Roleplay attacks have the model adopt a persona that would comply with the request. Multi-turn attacks build context over multiple exchanges. This taxonomy became the standard way the field talks about adversarial LLM vulnerabilities.
A counterintuitive finding: larger, more capable models are more vulnerable to adversarial attacks despite being better at understanding harmful intent. Increased capability means increased sophistication in harmful domains. This creates a fundamental tension: improving capability automatically expands the attack surface.
Red teaming can only find vulnerabilities that exist; it cannot prove absence of vulnerabilities. You can discover harms through testing, but successful testing doesn't guarantee safety. There are categories of harm the red teamers didn't think to test for, and passive vulnerabilities the model has but red teaming never uncovered.