Safety·Anthropic·Apr 2024

★13. Many-Shot Jailbreaking

Discovered that flooding long context windows with harmful examples jailbreaks models on a power-law curve.

Research Paper

Summary

Discovered that including hundreds of examples of undesirable behavior in long contexts can jailbreak most LLMs (including Claude, GPT-4, Llama 2). Effectiveness follows a power law with number of shots. Anthropic briefed competitors before publishing. Published at NeurIPS 2024.

Key Concepts

Many-Shot Jailbreaking

A jailbreak attack using hundreds of examples of undesirable behavior in a long context window to cause a model to replicate that behavior. Unlike traditional jailbreaks that use clever prompting, many-shot attacks work through in-context learning — the model literally learns from the examples you provide it. The attack demonstrates that long context windows are attack surfaces.

Capability-Safety Coupling

The finding that increasing context window size creates new safety vulnerabilities even as it adds useful capabilities. Longer context means more opportunity for in-context learning, but in-context learning itself is a vulnerability. This reveals an impossible tradeoff: you cannot safely have both long context windows and robust resistance to adversarial examples.

In-Context Learning Exploitation

Models are trained to learn from examples provided in their input context, which enables powerful few-shot learning. However, this same mechanism can be exploited to teach the model harmful behaviors by providing many examples of undesirable outputs. In-context learning is a feature that becomes a vulnerability when adversarial.

Responsible Disclosure

Rather than publishing the jailbreak immediately, Anthropic briefed OpenAI, Google, Meta, and other labs before public release. This unusual step gave competitors time to prepare mitigations before the vulnerability became widely known. This established a precedent that AI labs can work together on safety even as they compete on capability.

Connections

Influenced by

4. Constitutional AI: Harmlessness from AI Feedback

Dec 2022

Influences

16. Sabotage Evaluations for Frontier Models

Jun 2024