Alignment·Anthropic·May 2024

★15. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Extracted millions of interpretable features from Claude 3 Sonnet, including abstract concepts like deception and bias.

Research Paper

Summary

Scaled sparse autoencoders to Claude 3 Sonnet, finding tens of millions of interpretable features representing abstract concepts (cities, code patterns, deception, bias). Showed feature steering can modify model outputs in specific, interpretable ways. Landmark result proving interpretability scales to production models.

Key Concepts

Sparse Autoencoders at Production Scale

This paper scaled SAEs from toy one-layer models to Claude 3 Sonnet, a 70B production model with complex multi-layer architecture. Training SAEs at scale required significant compute and novel architectural innovations, but proved the technique generalizes across model sizes. SAEs now work on production models, not just research demonstrations.

Abstract Feature Discovery

Beyond finding concrete features like "mentions of Paris," the SAEs discovered abstract behavioral features including deception, bias, sycophancy, and other high-level concepts. These abstract features are directly relevant to AI safety and alignment. Finding that abstract, safety-relevant behaviors are interpretable as features opens possibilities for monitoring and control.

Feature Steering

Activation clamping — modulating the strength of feature activations to directly influence model behavior. By increasing deception features, researchers made Claude behave more deceptively; by clamping them, they reduced deceptive outputs. This demonstrates features have causal control over behavior, not just correlational relationships.

Safety-Relevant Features

Features that directly impact whether a model behaves safely or unsafely, such as deception, bias, and sycophancy. The discovery that these features are interpretable and steerable opens the possibility of building real-time safety monitoring — detecting when a model is activating deceptive features during inference and taking action.

Millions of Interpretable Features

The SAEs found tens of millions of interpretable features in Claude 3 Sonnet, vastly more than the number of model parameters. This suggests that the model's representational capacity is far richer than its parameter count implies. The abundance of features raises questions about coverage: do these features explain all of the model's behavior or just the most obvious parts?