Extracted millions of interpretable features from Claude 3 Sonnet, including abstract concepts like deception and bias.
Research PaperScaled sparse autoencoders to Claude 3 Sonnet, finding tens of millions of interpretable features representing abstract concepts (cities, code patterns, deception, bias). Showed feature steering can modify model outputs in specific, interpretable ways. Landmark result proving interpretability scales to production models.
This paper scaled SAEs from toy one-layer models to Claude 3 Sonnet, a 70B production model with complex multi-layer architecture. Training SAEs at scale required significant compute and novel architectural innovations, but proved the technique generalizes across model sizes. SAEs now work on production models, not just research demonstrations.
Beyond finding concrete features like "mentions of Paris," the SAEs discovered abstract behavioral features including deception, bias, sycophancy, and other high-level concepts. These abstract features are directly relevant to AI safety and alignment. Finding that abstract, safety-relevant behaviors are interpretable as features opens possibilities for monitoring and control.
Activation clamping — modulating the strength of feature activations to directly influence model behavior. By increasing deception features, researchers made Claude behave more deceptively; by clamping them, they reduced deceptive outputs. This demonstrates features have causal control over behavior, not just correlational relationships.
Features that directly impact whether a model behaves safely or unsafely, such as deception, bias, and sycophancy. The discovery that these features are interpretable and steerable opens the possibility of building real-time safety monitoring — detecting when a model is activating deceptive features during inference and taking action.
The SAEs found tens of millions of interpretable features in Claude 3 Sonnet, vastly more than the number of model parameters. This suggests that the model's representational capacity is far richer than its parameter count implies. The abundance of features raises questions about coverage: do these features explain all of the model's behavior or just the most obvious parts?