In this image, a person is observing a PowerPoint presentation titled "Lumiere: A space-time diffusion model for video generation." The slide explains a complex machine learning model using diagrams, showing components such as blocks, layers, and connections. The diagrams illustrate the architecture of the model, including elements like Spatio-Temporal UNet (STUNet), various types of inflations (convolution-based and attention-based), and other specific processes involved in the model's framework. The environment appears to be a conference or academic seminar, emphasizing the technical nature of the content being presented.
Text transcribed from the image:
Lumiere
A space-time diffusion model for video generation
(a) Space-Time UNet (STUNet)
W
TX H X W X D
Legend:
00
Spatial Resizing
A Temporal Resizing
Skip Connection
Conv-based Inflation
Attention-based Inflation
글x블x쁠x ++
l et al., "Lumiere: A space-time diffusion model for video generation," arXiv 2024.
(b) Convolution-based Inflation Block
Pretrained Spatial Layer(s)
2D Convolution
Norm+ activiation
ID Convolution
Norm+acaviation
Linear Projection
(c) Attention-based Inflation
Pretrained Spati
IDA
Copyright
Linca