This image captures a presentation slide titled "Text2Video-Zero," which explains the use of Stable Diffusion to generate videos without any finetuning. The text on the slide outlines the process: - **Start From Noises of Similar Pattern**: Given the first frame's noise, define a global scene motion. This motion is then used to translate the first frame’s noise to generate similar initial noise for other frames. The slide includes a graphical illustration of the process, depicting a flowchart with mathematical notations and neural network components. It details how a text prompt, such as "A horse is galloping on the street," is translated into visual frames using Diffusion Models (DDIM and DDPM) and convolutional mechanisms, further refined by cross-frame attention in a transformer block. Salient object detection is used to identify key elements (in this case, a horse) in the generated frames. This research is credited to Khachatryan et al., as cited from an arXiv paper. The presentation is authored by Mike Shou from NUS, indicated at the bottom right corner of the slide. The audience, partially visible at the bottom of the image, attentively views the slide in a presumably academic or research setting. Text transcribed from the image: QCC Text2 Video-Zero Use Stable Diffusion to generate videos without any finetuning • Start from noises of similar pattern: given the first frame's noise, define a global scene motion, used to translate the first frame's noise to generate similar initial noise for other frames ~N(0,1) x=DDIM Backward(x+, At, SD) a=Wx(x) x=DDPM Forward(a, At) for k=2,3,...,m Text prompt: "A horse is galloping on the street" Convolution Linear Projection Linear Projection Linear Projection Cross-Frame Attention Khachatryan et al., "Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators," arXiv 2023. Softmax (()) XT 0 Cross-Attention FFN Transformer Block x2 Salient Object Detector Copyright Mike Shou, NUS 115