A presenter at a conference is showing a slide on "NUWA-XL," specifically focusing on "Mask Temporal Diffusion (MTD)." The slide outlines MTD as a basic diffusion model for both global and local diffusion models, detailed with diagrams and text blocks. The audience, partially visible from the back, is attentively watching the presentation. The presenter is standing next to a laptop and a podium on the right side of the screen. The setting appears to be a conference room with a large projection screen displaying detailed technical content. The presenter seems to be explaining a complex concept related to video generation using diffusion models, as indicated by the phrases and intricate diagrams on the slide.
Text transcribed from the image:
NUWA-XL
Recursive interpolations for generating very long videos
Mask Temporal Diffusion (MTD)
• A basic diffusion model for global & local diffusion models
CEN(0,1)
L Prompts
V
CLIP text Enc
Timestep-U(1,7)
V
Time Enc
d.
d
mask
middle frames
W%
T-KLVAE Enc
T-KLVAE Enc
MSE Loss
A
€0(x2)
DownBlock
Itout
UpBlock
SAN
ˇˋ
DownBlock
Conv Down
ˇˋ
UpBlock
Masking visual conditions
DownBlock
Conv Down
UpBlock
SA
Global diffusion: mask all
Local diffusion: mask middle frames
DownBlock
Conv Down
UpBlock
MidBlock
A
P
→Diffusion Process
→ Visual Condition
→ Prompts
→Timesteps
Yin et al., "NUWA-XL: Diffusion over Diffusion for extremely Long Video Generation," arXiv 2023.
Copyright Mike Shou, NUS
MUM
PANCY
12