The Sequence Knowledge #874: Transformers or Not?

One of the biggest debates in modern AI.

Jun 09, 2026

∙ Paid

💡 AI Concept of the Day: Transformers or Not?

The Transformer is currently the reference architecture for serious AI. Not because it is obviously the most brain-like, elegant, or efficient design, but because it has the best scaling story. You add data, parameters, compute, context length, better training recipes, better post-training, and the model gets better in a surprisingly smooth way. That is rare. In deep learning, many ideas are clever. Few are industrial.

The Transformer’s superpower is attention. Every token can look at every other token and decide what matters. This is an incredibly general operation. It works for language, code, images, audio, video, protein sequences, robotics tokens, and tool traces. The architecture is simple enough to scale, parallel enough to train efficiently, and expressive enough to absorb huge datasets.

But it has an obvious tax: attention is expensive. Full self-attention scales badly with sequence length. In autoregressive generation, the model accumulates a key-value cache, which grows with context. A Transformer remembers by keeping a large, explicit, token-indexed memory. That is powerful, but it is not how you would design every intelligent system from first principles.

So the question is not “are Transformers good?” They are spectacular. The question is: are they the final architecture? Or are they the first truly scalable architecture, soon to be absorbed into something richer?

I think the second view is more likely.

TheSequence

The Sequence Knowledge #874: Transformers or Not?

One of the biggest debates in modern AI.

💡 AI Concept of the Day: Transformers or Not?

The Landscape of Alternatives

This post is for paid subscribers