TheSequence

TheSequence

Share this post

TheSequence
TheSequence
The Sequence Opinion #667: The Superposition Hypothesis And How it Changed AI Interpretability

The Sequence Opinion #667: The Superposition Hypothesis And How it Changed AI Interpretability

The theory that opened the field of mechaninistic interpretability

Jun 19, 2025
∙ Paid
14

Share this post

TheSequence
TheSequence
The Sequence Opinion #667: The Superposition Hypothesis And How it Changed AI Interpretability
3
Share

Created Using GPT-4o

Mechanistic interpretability—the study of how neural networks internally represent and compute—seeks to illuminate the opaque transformations learned by modern models. At the heart of this pursuit lies a deceptively simple question: what does a neuron mean? Early efforts hoped that neurons, particularly in deeper layers, might correspond to human-interpretable concepts: edges in images, parts of faces, topics in language. But as interpretability research matured, it became clear that many neurons stubbornly resisted such neat categorization. A single neuron might activate for multiple, seemingly unrelated inputs. This phenomenon of polysemanticity complicates efforts to reverse-engineer networks and has led to a key theoretical insight: the superposition hypothesis.

The superposition hypothesis proposes that neural networks are not built around one-neuron-per-feature mappings, but rather represent features as directions in high-dimensional activation spaces. Each neuron contributes to many features, and each feature is spread across many neurons. This leads to overlapping, linearly superimposed representations. Superposition, in this view, is not a flaw or an accident. It is a natural consequence of attempting to store more features than there are neurons to represent them. Neural networks, constrained by finite width and encouraged by sparsity in data, adopt a compressed representation strategy in which meaning is woven through a shared vector space. This hypothesis explains why neurons are often polysemantic and why interpretability must evolve beyond a neuron-centric view.

From Monosemanticity to Polysemanticity: A Representational Shift

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Jesus Rodriguez
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share