Tri Dao

‍Transformers have grown deeper and wider, but training them on long sequences remains difficult. The attention layer at their heart is the compute and memory bottleneck: doubling the sequence length would quadruple the runtime and memory requirements. FlashAttention is a new algorithm to speed up attention and reduce its memory footprint—without any approximation. Since FlashAttention was released 6 months ago, it has been adopted by many organizations and research labs to speed up their training & inference (see this page for a partial list)....