FlashAttention: Fast Transformer training with long sequences

‍Transformers have grown deeper and wider, but training them on long sequences remains difficult. The attention layer at their heart is the compute and memory bottleneck: doubling the sequence length would quadruple the runtime and memory requirements.

FlashAttention is a new algorithm to speed up attention and reduce its memory footprint—without any approximation.

Since FlashAttention was released 6 months ago, it has been adopted by many organizations and research labs to speed up their training & inference (see this page for a partial list).

For the last 2 months I’ve been collaborating with Adept as a part-time research fellow and we’ve been developing some improvements to FlashAttention to make it even better! In this post, we describe one key improvement that we’re particularly excited about: making FlashAttention fast for long sequences to enable training large language models with longer context.

As an example, for sequence length 8k, FlashAttention is now up to 2.7x faster than a standard Pytorch implementation, and up to 2.2x faster than the optimized implementation from Megatron-LM, even at small batch size. As we will see, training with longer context yields higher quality models. As we’ve mentioned before, we believe that modeling longer sequences could help us take the next leap in AI, and FlashAttention is one component to scale Transformers to longer context. At Adept, we’ve been training large Transformers (ACT-1 ) to take actions with the goal of building an AI teammate. Understanding webpages, software tool interfaces, and multi-turn user interactions can require contexts that far exceed the common 2k standard.

Motivation: Long sequences

Scaling up the context length of Transformers is a challenge, since the multihead attention layer at their heart has runtime and memory requirement quadratic in the input sequence length. Ideally, we would like to go beyond the standard 2k sequence length limit to train models to understand books, high resolution images, webpages, multi-turn user interactions, and long-form videos.

FlashAttention is an algorithm that reorders the attention computation and leverages classical techniques (tiling, recomputation) to significantly speed it up and reduce memory usage from quadratic to linear in sequence length. This works great for most cases, but it was not optimized for the case of super long sequences (where batch sizes and numbers of heads are small) due to insufficient parallelism. If one trains large Transformers on long sequences with modern parallelism techniques (data parallel, pipeline parallel, tensor parallel) to split the data and model among many GPUs, the batch size can get very small (e.g. batch size of 1 with pipeline parallelism, and number of heads around 8-12 with tensor parallelism). This is the case we would like to optimize for.

Attention parallelism to optimize for long sequences

For each attention head, to reduce memory reads/writes, FlashAttention uses classical tiling techniques to load blocks of query, key, and value from GPU HBM (its main memory) to SRAM (its fast cache), compute attention with respect to that block, and write back the output to HBM. This reduction in memory reads/writes brings significant speedup (2-4x) in most cases.