A presentation on "Snap Video: Transformer-Based Video Diffusion Models" is shown on a large screen at a conference. The slide features two main sections: "Computational Paradigms for Videos" and "Snap Video FIT Architecture." The left part of the slide includes diagrams illustrating various computational approaches for video synthesis, comparing separable spatiotemporal models with their "Snap Video" approach. The right part of the slide details the architecture of Snap Video FIT, including a sequence of processes such as text encoding, joint S.T. block, compression, and decompression, illustrated through detailed and colorful flowcharts. Attendees, seated on the floor, are attentively observing the presentation. The details on the slide mention the citation: "Menapace et al., 'Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis,' CVPR 2024." The room has a professional setting with a mounted projector screen and conference facilities in the background.
Text transcribed from the image:
Snap Video
Transformer-based video diffusion models
Separable Block
Spatial
Separable Block
Spatial
Temporal
Temporal
Leamable Compression
Joint Spatiotemporal
Block
Learnable Decompression
Text Enc.
Joint S.T. Block
Compression
Decompression
"Workout session"
xa
Noise Std. σ
Framerate V →ovr
Resolution 7
XAttn+FF
Latent Tok.
FIT Block
Nx
XAttn+FF
XAttn+FF
Nx
Separable Spatiotemporal Model
Our Joint Spatiotemporal Model
Pixels to patches
Patch Tokens
Patches to pixels
(a) Computational Paradigms for Videos
(b) Snap Video FIT Architecture
Menapace et al., "Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis.," CVPR 2024.