A presenter is giving a talk in a conference room, showcasing a slide titled "Lumiere: A space-time diffusion model for video generation." The slide contains a detailed diagram illustrating the components and workflow of the model. The room features a large projection screen and attendees are seated in front, attentively observing the presentation. One person in the foreground, wearing a black cap, is focused on the screen. The presenter is standing at a podium to the right of the screen. The setting appears professional, with organized seating and modern lighting, creating an environment conducive to learning and discussion.
Text transcribed from the image:
Lumiere
A space-time diffusion model for video generation
(a) Space-Time UNet (STUNet)
Legend:
Spatial Resizing
Temporal Resizing
Skip Connection
Conv-based Inflation
Attention-based Inflation
TxHxWxD
(b) Convolution-based Inflation Block
Pretrained Spatial Layer(s)
2D Convolution
Norm + activiation
ID Convolution
Norm+ activiation
Linear Projection
(e) Attention-based Inflation Block
Pretrained Spatial Layer(s)
ID Attention (x)
Linear Projection
Bar-Tal et al., "Lumiere: A space-time diffusion model for video generation," arXiv 2024.
3108
Copyright Mike Shou, NUS