Attendees of a technical conference session attentively observe a detailed presentation on "Snap Video" projected on a screen. The slide appears to describe "Transformer-based video diffusion models" with visual diagrams outlining different computational paradigms for video processing and the Snap Video FFT architecture. The room has a professional ambiance, with attendees seated on the carpeted floor, closely engaged with the content being presented. The dim lighting and focused environment underscore the intense concentration and interest of the participants in the advanced technological topic being discussed.
Text transcribed from the image:
Snap Video
Transformer-based video diffusion models
Sk
Separable Spelotemporal Modeli
"Workout session"
Text Fa
Joint S.Tock
Pixels to patches
Our Joint Spatiotemporal Model
(a) Computational Paradigms for Videos
Menapace et al, "Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis," CVPR 2024,
XA+FF
FIT Block
XA+FE
Patch Tokens
(b) Snap Video FIT Architecture