FlashAttention: Fast Transformer training with long sequences
Transformers have grown deeper and wider, but training them on long sequences remains difficult. The attention layer at their heart is the compute and memory bottleneck: doubling the sequence length would quadruple the runtime and memory requirements. FlashAttention is a new algorithm to speed up attention and reduce its memory footprint—without any approximation. Since FlashAttention was released 6 months ago, it has been adopted by many organizations and research labs to speed up their training & inference (see this page for a partial list)....