A detailed caption for the image could be: "Attendees carefully listen during a presentation showcasing 'Lumiere: A space-time diffusion model for video generation' at a conference. The slide on display illustrates the architecture of the model, highlighting components such as the Space-Time UNet (STUNet) and convolution-based inflation blocks. Diagrammatic representations and legends explain various processes like spatial resizing, temporal resizing, and connections within the model. The slide also shows blocks for pre-trained spatial layers, 2D and 1D convolutions, norm activation, and linear projection, among other details. The presentation aims to convey the innovative approach of the Lumiere model in generating video content. The detailed illustration signifies the complexity and depth of the model's architecture and methodology." Text transcribed from the image: Lumiere A space-time diffusion model for video generation (a) Space-Time UNet (STUNet) W TX H X W X D Legend: 00 Spatial Resizing A Temporal Resizing Skip Connection Conv-based Inflation Attention-based Inflation 글x블x쁠x ++ l et al., "Lumiere: A space-time diffusion model for video generation," arXiv 2024. (b) Convolution-based Inflation Block Pretrained Spatial Layer(s) 2D Convolution Norm+ activiation ID Convolution Norm+acaviation Linear Projection (c) Attention-based Inflation Pretrained Spati IDA Copyright Linca