Alignment·Anthropic·Oct 2023

9. Towards Monosemanticity: Decomposing Language Models with Dictionary Learning

Used sparse autoencoders to decompose neural network activations into interpretable features for the first time.

Research Paper

Summary

Used sparse autoencoders to extract monosemantic features from a one-layer transformer model. Proved dictionary learning can decompose neural network activations into interpretable concepts. Foundational work for later Scaling Monosemanticity breakthrough.

Key Concepts

Superposition

In neural networks, individual neurons often encode multiple distinct concepts simultaneously due to limited neuron availability. Superposition is the phenomenon where a neuron's activation pattern represents a mixture of different semantic meanings. This paper empirically verified superposition exists and demonstrated that sparse autoencoders can disentangle these mixed representations into individual, interpretable features.

Sparse Autoencoders (SAEs)

A technique from signal processing adapted for interpretability that learns a high-dimensional, sparsely activated dictionary of features. The autoencoder compresses neural network activations into a larger space where each dimension represents a single concept, using sparsity constraints to ensure only a few features activate at a time. This transforms uninterpretable polysemantic neurons into interpretable monosemantic features.

Monosemantic Features

Features that have a single, well-defined semantic meaning. Unlike polysemantic neurons that respond to multiple unrelated concepts, monosemantic features extracted by SAEs activate in response to specific, interpretable concepts. This makes the feature's function understandable to humans without requiring complex analysis.

Dictionary Learning

A mathematical framework from signal processing for finding sparse, interpretable representations of high-dimensional data. Dictionary learning solves the inverse problem of reconstructing signals from a sparse combination of learned basis elements. Applied to neural networks, it enables decomposing dense activations into sparse, meaningful features.

Connections

Influences

15. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

May 2024