Attendees of a technical conference session attentively observe a detailed presentation on "Snap Video" projected on a screen. The slide appears to describe "Transformer-based video diffusion models" with visual diagrams outlining different computational paradigms for video processing and the Snap Video FFT architecture. The room has a professional ambiance, with attendees seated on the carpeted floor, closely engaged with the content being presented. The dim lighting and focused environment underscore the intense concentration and interest of the participants in the advanced technological topic being discussed. Text transcribed from the image: Snap Video Transformer-based video diffusion models Sk Separable Spelotemporal Modeli "Workout session" Text Fa Joint S.Tock Pixels to patches Our Joint Spatiotemporal Model (a) Computational Paradigms for Videos Menapace et al, "Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis," CVPR 2024, XA+FF FIT Block XA+FE Patch Tokens (b) Snap Video FIT Architecture