A detailed caption for the image could be:

"Attendees at a conference sit attentively facing a presentation screen that displays a slide titled 'Snap Video: Transformer-based video diffusion models.' The slide outlines two key components: (a) Computational Paradigms for Videos, which illustrates different spatiotemporal models using a sequence of images depicting various activities, and (b) Snap Video FIT Architecture, which elaborates on the complex text-to-video synthesis process using labeled blocks and arrows. The diagram is attributed to a paper by Menapace et al., set to be presented at CVPR 2024. The attendees appear engaged, focusing on the detailed explanations of innovative approaches in video processing."
Text transcribed from the image:
Snap Video
Transformer-based video diffusion models
Separable Block
Spatial
Separable Block
Spatial
Temporal
Temporal
Leamable Compression
Joint Spatiotemporal
Block
Learnable Decompression
Text Enc.
Joint S.T. Block
Compression
Decompression
"Workout session"
xa
Noise Std. σ
Framerate V →ovr
Resolution 7
XAttn+FF
Latent Tok.
FIT Block
Nx
XAttn+FF
XAttn+FF
Nx
Separable Spatiotemporal Model
Our Joint Spatiotemporal Model
Pixels to patches
Patch Tokens
Patches to pixels
(a) Computational Paradigms for Videos
(b) Snap Video FIT Architecture
Menapace et al., "Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis.," CVPR 2024.