Participants sit attentively in a conference room, focusing on a presentation about "Snap Video: Transformer-based Video Diffusion Models." The slide on the screen showcases detailed graphics and flowcharts explaining computational paradigms for video and the architecture of Snap Video TFT. The setup includes a projector placed on a table to the left, casting the presentation onto a large screen. The audience appears engaged, with some individuals seated on the floor taking notes or simply observing. The room is well-lit, with a modern ceiling design and a carpeted floor. The atmosphere reflects a professional and educational setting, likely part of a larger workshop or seminar on advanced technological concepts.
Text transcribed from the image:
Snap Video
Transformer-based video diffusion models
Sk
Separable Spelotemporal Modeli
"Workout session"
Text Fa
Joint S.Tock
Pixels to patches
Our Joint Spatiotemporal Model
(a) Computational Paradigms for Videos
Menapace et al, "Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis," CVPR 2024,
XA+FF
FIT Block
XA+FE
Patch Tokens
(b) Snap Video FIT Architecture