SS
About Me
Frontier AI Paper BriefingsPokebowlClinical Trial EnrollerLittle Human Names
DisclaimersPrivacy PolicyTerms of Use
Privacy Policy·Terms of Use·Disclaimers

© 2026 Silvia Seceleanu

← Back to Explorer
Safety & Alignment·Anthropic·May 2024

★15. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Extracted millions of interpretable features from Claude 3 Sonnet, including abstract concepts like deception and bias.

Research Paper
Summary

Scaled sparse autoencoders to Claude 3 Sonnet, finding tens of millions of interpretable features representing abstract concepts (cities, code patterns, deception, bias). Showed feature steering can modify model outputs in specific, interpretable ways. Landmark result proving interpretability scales to production models.

Key Concepts

Sparse Autoencoders at Production Scale

This paper scaled SAEs from toy one-layer models to Claude 3 Sonnet, a 70B production model with complex multi-layer architecture. Training SAEs at scale required significant compute and novel architectural innovations, but proved the technique generalizes across model sizes. SAEs now work on production models, not just research demonstrations.

Abstract Feature Discovery

Beyond finding concrete features like "mentions of Paris," the SAEs discovered abstract behavioral features including deception, bias, sycophancy, and other high-level concepts. These abstract features are directly relevant to AI safety and alignment. Finding that abstract, safety-relevant behaviors are interpretable as features opens possibilities for monitoring and control.

Feature Steering

Activation clamping — modulating the strength of feature activations to directly influence model behavior. By increasing deception features, researchers made Claude behave more deceptively; by clamping them, they reduced deceptive outputs. This demonstrates features have causal control over behavior, not just correlational relationships.

Safety-Relevant Features

Features that directly impact whether a model behaves safely or unsafely, such as deception, bias, and sycophancy. The discovery that these features are interpretable and steerable opens the possibility of building real-time safety monitoring — detecting when a model is activating deceptive features during inference and taking action.

Millions of Interpretable Features

The SAEs found tens of millions of interpretable features in Claude 3 Sonnet, vastly more than the number of model parameters. This suggests that the model's representational capacity is far richer than its parameter count implies. The abundance of features raises questions about coverage: do these features explain all of the model's behavior or just the most obvious parts?

Connections

15. Scaling Monosema…May 20249. Towards Monosema…Oct 202323. Alignment Faking…Dec 202427. Tracing the Thou…Mar 2025Influenced byInfluences
Influenced by
9. Towards Monosemanticity: Decomposing Language Models with Dictionary Learning
Oct 2023
Influences
23. Alignment Faking in Large Language Models
Dec 2024
27. Tracing the Thoughts of a Large Language Model (Circuit Tracing)
Mar 2025