In this image, a presenter is delivering a talk on "NUWA-XL: Recursive Interpolations for Generating Very Long Videos." Displayed on the projection screen, the slide details a model called Mask Temporal Diffusion (MTD), highlighting a basic diffusion model for both global and local diffusion models. The slide includes a complex flowchart that visually represents the process and architecture of the MTD model, with annotations related to diffusion processes, prompts, visual conditions, and timestamps. Notably, the diagram includes several blocks such as DownBlock, UpBlock, and MidBlock, and components like "T-KVAE Enc" and "T-KVAE Dec." There is a focus on the steps needed to mask visual conditions both globally and locally. The audience, consisting of a few attendees, is attentively watching the presentation from their seats. The presenter stands to the right of the screen, explaining the intricate details of the model. The conference room is well-lit, with a plain beige background that keeps the focus on the informative slide.
Text transcribed from the image:
NUWA-XL
Recursive interpolations for generating very long videos
Mask Temporal Diffusion (MTD)
â€¢ A basic diffusion model for global & local diffusion models
CEN(0,1)
L Prompts
V
CLIP text Enc
Timestep-U(1,7)
V
Time Enc
d.
d
mask
middle frames
W%
T-KLVAE Enc
T-KLVAE Enc
MSE Loss
A
â‚¬0(x2)
DownBlock
Itout
UpBlock
SAN
Ë‡Ë‹
DownBlock
Conv Down
Ë‡Ë‹
UpBlock
Masking visual conditions
DownBlock
Conv Down
UpBlock
SA
Global diffusion: mask all
Local diffusion: mask middle frames
DownBlock
Conv Down
UpBlock
MidBlock
A
P
â†’Diffusion Process
â†’ Visual Condition
â†’ Prompts
â†’Timesteps
Yin et al., "NUWA-XL: Diffusion over Diffusion for extremely Long Video Generation," arXiv 2023.
Copyright Mike Shou, NUS
MUM
PANCY
12