A presentation on "Snap Video: Transformer-Based Video Diffusion Models" is shown on a large screen at a conference. The slide features two main sections: "Computational Paradigms for Videos" and "Snap Video FIT Architecture." The left part of the slide includes diagrams illustrating various computational approaches for video synthesis, comparing separable spatiotemporal models with their "Snap Video" approach. The right part of the slide details the architecture of Snap Video FIT, including a sequence of processes such as text encoding, joint S.T. block, compression, and decompression, illustrated through detailed and colorful flowcharts. Attendees, seated on the floor, are attentively observing the presentation. The details on the slide mention the citation: "Menapace et al., 'Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis,' CVPR 2024." The room has a professional setting with a mounted projector screen and conference facilities in the background. Text transcribed from the image: Snap Video Transformer-based video diffusion models Separable Block Spatial Separable Block Spatial Temporal Temporal Leamable Compression Joint Spatiotemporal Block Learnable Decompression Text Enc. Joint S.T. Block Compression Decompression "Workout session" xa Noise Std. σ Framerate V →ovr Resolution 7 XAttn+FF Latent Tok. FIT Block Nx XAttn+FF XAttn+FF Nx Separable Spatiotemporal Model Our Joint Spatiotemporal Model Pixels to patches Patch Tokens Patches to pixels (a) Computational Paradigms for Videos (b) Snap Video FIT Architecture Menapace et al., "Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis.," CVPR 2024.