A man presenting at a conference or seminar, focusing on the topic "NUWA-XL: Recursive interpolations for generating very long videos." The slide on the screen details the "Mask Temporal Diffusion (MTD)" model, a diffusion model for global and local diffusion models, highlighting components like "CLIP text Enc," "Time Enc," and various convolution and up block layers. The slide also mentions "Masking visual conditions" with global and local diffusion strategies. Attendees in the audience are attentively listening to the presentation. The environment suggests a professional setting with technical equipment and a formal arrangement. Text transcribed from the image: NUWA-XL Recursive interpolations for generating very long videos Mask Temporal Diffusion (MTD) • A basic diffusion model for global & local diffusion models CEN(0,1) L Prompts V CLIP text Enc Timestep-U(1,7) V Time Enc d. d mask middle frames W% T-KLVAE Enc T-KLVAE Enc MSE Loss A €0(x2) DownBlock Itout UpBlock SAN ˇˋ DownBlock Conv Down ˇˋ UpBlock Masking visual conditions DownBlock Conv Down UpBlock SA Global diffusion: mask all Local diffusion: mask middle frames DownBlock Conv Down UpBlock MidBlock A P →Diffusion Process → Visual Condition → Prompts →Timesteps Yin et al., "NUWA-XL: Diffusion over Diffusion for extremely Long Video Generation," arXiv 2023. Copyright Mike Shou, NUS MUM PANCY 12