Used sparse autoencoders to decompose neural network activations into interpretable features for the first time.
Research PaperUsed sparse autoencoders to extract monosemantic features from a one-layer transformer model. Proved dictionary learning can decompose neural network activations into interpretable concepts. Foundational work for later Scaling Monosemanticity breakthrough.
In neural networks, individual neurons often encode multiple distinct concepts simultaneously due to limited neuron availability. Superposition is the phenomenon where a neuron's activation pattern represents a mixture of different semantic meanings. This paper empirically verified superposition exists and demonstrated that sparse autoencoders can disentangle these mixed representations into individual, interpretable features.
A technique from signal processing adapted for interpretability that learns a high-dimensional, sparsely activated dictionary of features. The autoencoder compresses neural network activations into a larger space where each dimension represents a single concept, using sparsity constraints to ensure only a few features activate at a time. This transforms uninterpretable polysemantic neurons into interpretable monosemantic features.
Features that have a single, well-defined semantic meaning. Unlike polysemantic neurons that respond to multiple unrelated concepts, monosemantic features extracted by SAEs activate in response to specific, interpretable concepts. This makes the feature's function understandable to humans without requiring complex analysis.
A mathematical framework from signal processing for finding sparse, interpretable representations of high-dimensional data. Dictionary learning solves the inverse problem of reconstructing signals from a sparse combination of learned basis elements. Applied to neural networks, it enables decomposing dense activations into sparse, meaningful features.