The Sequence Engineering #556: Inside Anthropic's New Open Source AI Interpretability Tools
The circuit tracing tools stack represents one of the most important recent releases in AI interpretability.
Steadily and quietly, Anthropic has become the leading AI lab in interpretability. Specifically, Anthropic has been aggresively championing the emerging field of mechanistic interpretability as a way to explain the outputs in frontier models. Recently, they published a groundbreaking research about tracing the thoughts of language models. They follow this with an amazing open source release of circuit tracing tools that is the most impressive thing I’ve thing in AI interpreability in a long time. And the topic of today’s essay.
Join Me for a Chat About AI Evals and Benchmarks: