A detailed caption for the image could be:

"Attendees carefully listen during a presentation showcasing 'Lumiere: A space-time diffusion model for video generation' at a conference. The slide on display illustrates the architecture of the model, highlighting components such as the Space-Time UNet (STUNet) and convolution-based inflation blocks. Diagrammatic representations and legends explain various processes like spatial resizing, temporal resizing, and connections within the model. The slide also shows blocks for pre-trained spatial layers, 2D and 1D convolutions, norm activation, and linear projection, among other details. The presentation aims to convey the innovative approach of the Lumiere model in generating video content. The detailed illustration signifies the complexity and depth of the model's architecture and methodology."
Text transcribed from the image:
Lumiere
A space-time diffusion model for video generation
(a) Space-Time UNet (STUNet)
W
TX H X W X D
Legend:
00
Spatial Resizing
A Temporal Resizing
Skip Connection
Conv-based Inflation
Attention-based Inflation
글x블x쁠x ++
l et al., "Lumiere: A space-time diffusion model for video generation," arXiv 2024.
(b) Convolution-based Inflation Block
Pretrained Spatial Layer(s)
2D Convolution
Norm+ activiation
ID Convolution
Norm+acaviation
Linear Projection
(c) Attention-based Inflation
Pretrained Spati
IDA
Copyright
Linca