TheSequence

TheSequence

Share this post

TheSequence
TheSequence
Edge 420: Inside FlashAttention-3, The Algorithm Pushing the New Wave of Transformers

Edge 420: Inside FlashAttention-3, The Algorithm Pushing the New Wave of Transformers

The new algorithm takes full advantage of the capabilities of H100 GPUs.

Aug 08, 2024
∙ Paid
19

Share this post

TheSequence
TheSequence
Edge 420: Inside FlashAttention-3, The Algorithm Pushing the New Wave of Transformers
2
Share
Created Using Ideogram

There are few algorithms that have had as much impact on the recent generation of transformer architectures as FlashAttention. Originally developed by researchers from Princeton University, including the renowned Tri Dao, FlashAttention and its successor FlashAttention-2 were able to improve the performance of attention mechanisms in GPUs by minimizing read-writes. Almost immediately after the original publication, FlashAttention was rapidly adopted within the new generation of transformers. There were not many complaints about FlashAttention, but one of the few was that it was unable to take full advantage of new hardware architectures. For instance, FlashAttention-2 is only able to achieve 35% utilization of max FLOPs in H100 GPUs.

But now we have a new version.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Jesus Rodriguez
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share