Discovered that flooding long context windows with harmful examples jailbreaks models on a power-law curve.
Research PaperDiscovered that including hundreds of examples of undesirable behavior in long contexts can jailbreak most LLMs (including Claude, GPT-4, Llama 2). Effectiveness follows a power law with number of shots. Anthropic briefed competitors before publishing. Published at NeurIPS 2024.
A jailbreak attack using hundreds of examples of undesirable behavior in a long context window to cause a model to replicate that behavior. Unlike traditional jailbreaks that use clever prompting, many-shot attacks work through in-context learning — the model literally learns from the examples you provide it. The attack demonstrates that long context windows are attack surfaces.
The finding that increasing context window size creates new safety vulnerabilities even as it adds useful capabilities. Longer context means more opportunity for in-context learning, but in-context learning itself is a vulnerability. This reveals an impossible tradeoff: you cannot safely have both long context windows and robust resistance to adversarial examples.
Models are trained to learn from examples provided in their input context, which enables powerful few-shot learning. However, this same mechanism can be exploited to teach the model harmful behaviors by providing many examples of undesirable outputs. In-context learning is a feature that becomes a vulnerability when adversarial.
Rather than publishing the jailbreak immediately, Anthropic briefed OpenAI, Google, Meta, and other labs before public release. This unusual step gave competitors time to prepare mitigations before the vulnerability became widely known. This established a precedent that AI labs can work together on safety even as they compete on capability.