A detailed caption for the image could be: "Attendees at a conference sit attentively facing a presentation screen that displays a slide titled 'Snap Video: Transformer-based video diffusion models.' The slide outlines two key components: (a) Computational Paradigms for Videos, which illustrates different spatiotemporal models using a sequence of images depicting various activities, and (b) Snap Video FIT Architecture, which elaborates on the complex text-to-video synthesis process using labeled blocks and arrows. The diagram is attributed to a paper by Menapace et al., set to be presented at CVPR 2024. The attendees appear engaged, focusing on the detailed explanations of innovative approaches in video processing." Text transcribed from the image: Snap Video Transformer-based video diffusion models Separable Block Spatial Separable Block Spatial Temporal Temporal Leamable Compression Joint Spatiotemporal Block Learnable Decompression Text Enc. Joint S.T. Block Compression Decompression "Workout session" xa Noise Std. σ Framerate V →ovr Resolution 7 XAttn+FF Latent Tok. FIT Block Nx XAttn+FF XAttn+FF Nx Separable Spatiotemporal Model Our Joint Spatiotemporal Model Pixels to patches Patch Tokens Patches to pixels (a) Computational Paradigms for Videos (b) Snap Video FIT Architecture Menapace et al., "Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis.," CVPR 2024.