Participants sit attentively in a conference room, focusing on a presentation about "Snap Video: Transformer-based Video Diffusion Models." The slide on the screen showcases detailed graphics and flowcharts explaining computational paradigms for video and the architecture of Snap Video TFT. The setup includes a projector placed on a table to the left, casting the presentation onto a large screen. The audience appears engaged, with some individuals seated on the floor taking notes or simply observing. The room is well-lit, with a modern ceiling design and a carpeted floor. The atmosphere reflects a professional and educational setting, likely part of a larger workshop or seminar on advanced technological concepts. Text transcribed from the image: Snap Video Transformer-based video diffusion models Sk Separable Spelotemporal Modeli "Workout session" Text Fa Joint S.Tock Pixels to patches Our Joint Spatiotemporal Model (a) Computational Paradigms for Videos Menapace et al, "Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis," CVPR 2024, XA+FF FIT Block XA+FE Patch Tokens (b) Snap Video FIT Architecture