The image depicts a man attending a presentation in a professional setting. The room is filled with computers and laptops, and the man is seated in front of a screen with a whiteboard. There are several people in the room, and they are all focused on the presentation. The image is a close-up of the man's face, and he appears to be listening attentively to the speaker. In the background, there is a train passing by, adding to the overall ambiance of the scene.
Text transcribed from the image:
Flash Attention-3: Optimizing FlashAttention for H100 GPU
Jay Shah*, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao
1. New instructions on H100:
WGMMA: higher throughput
- TMA: faster loading from gmem <-> smem, saves registers
2. Asynchrony - Overlap gemm and softmax
Builds on asynchronous wgmma, TMA, tx barrier
- Inter-warpgroup overlapping: warp-specialization, pingpong
- Intra-warpgroup overlapping
3. Low-precision - FP8: incoherent processing to reduce errors
10