In the image, a group of people are gathered in a television studio, with some of them sitting in front of a screen and others standing. On the screen, a presentation is being shown, featuring digital information and photos. The presentation appears to be focused on "customer sound visual stuff," likely indicating a topic related to sound and visual media. The people in the image seem to be engaged in the presentation, as they are attentively looking at the screen. The television studio setting suggests that this event is being broadcasted or recorded for distribution. Text transcribed from the image: Snap Video Transformer-based video diffusion models Separable Block Spatial Separable Block Spatial Temporal Temporal Leamable Compression Joint Spatiotemporal Block Learnable Decompression Text Enc. Joint S.T. Block Compression Decompression "Workout session" xa Noise Std. σ Framerate V →ovr Resolution 7 XAttn+FF Latent Tok. FIT Block Nx XAttn+FF XAttn+FF Nx Separable Spatiotemporal Model Our Joint Spatiotemporal Model Pixels to patches Patch Tokens Patches to pixels (a) Computational Paradigms for Videos (b) Snap Video FIT Architecture Menapace et al., "Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis.," CVPR 2024.