The Sequence Opinion #667: The Superposition Hypothesis And How it Changed AI Interpretability
The theory that opened the field of mechaninistic interpretability
Mechanistic interpretability—the study of how neural networks internally represent and compute—seeks to illuminate the opaque transformations learned by modern models. At the heart of this pursuit lies a deceptively simple question: what does a neuron mean? Early efforts hoped that neurons, particularly in deeper layers, might correspond to human-interpretable concepts: edges in images, parts of faces, topics in language. But as interpretability research matured, it became clear that many neurons stubbornly resisted such neat categorization. A single neuron might activate for multiple, seemingly unrelated inputs. This phenomenon of polysemanticity complicates efforts to reverse-engineer networks and has led to a key theoretical insight: the superposition hypothesis.
The superposition hypothesis proposes that neural networks are not built around one-neuron-per-feature mappings, but rather represent features as directions in high-dimensional activation spaces. Each neuron contributes to many features, and each feature is spread across many neurons. This leads to overlapping, linearly superimposed representations. Superposition, in this view, is not a flaw or an accident. It is a natural consequence of attempting to store more features than there are neurons to represent them. Neural networks, constrained by finite width and encouraged by sparsity in data, adopt a compressed representation strategy in which meaning is woven through a shared vector space. This hypothesis explains why neurons are often polysemantic and why interpretability must evolve beyond a neuron-centric view.