In the image, a group of people are gathered in a television studio, with some of them sitting in front of a screen and others standing. On the screen, a presentation is being shown, featuring digital information and photos. The presentation appears to be focused on "customer sound visual stuff," likely indicating a topic related to sound and visual media. The people in the image seem to be engaged in the presentation, as they are attentively looking at the screen. The television studio setting suggests that this event is being broadcasted or recorded for distribution.
Text transcribed from the image:
Snap Video
Transformer-based video diffusion models
Separable Block
Spatial
Separable Block
Spatial
Temporal
Temporal
Leamable Compression
Joint Spatiotemporal
Block
Learnable Decompression
Text Enc.
Joint S.T. Block
Compression
Decompression
"Workout session"
xa
Noise Std. σ
Framerate V →ovr
Resolution 7
XAttn+FF
Latent Tok.
FIT Block
Nx
XAttn+FF
XAttn+FF
Nx
Separable Spatiotemporal Model
Our Joint Spatiotemporal Model
Pixels to patches
Patch Tokens
Patches to pixels
(a) Computational Paradigms for Videos
(b) Snap Video FIT Architecture
Menapace et al., "Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis.," CVPR 2024.