This image is taken from a presentation slide discussing "Text2Video-Zero," a framework for generating videos from text prompts using Stable Diffusion without any finetuning. The slide outlines the process, starting from noises of a similar pattern, which are used to translate the first frame's noise to generate similar initial noise for other frames. This ensures the global scene motion is consistent. The diagram on the slide illustrates the process, including the DDIM Backward and DDPM Forward operations, and shows how the text prompt "A horse is galloping on the street" can drive the generation process. A salient object detector is also depicted, indicating its role in identifying and maintaining key features, such as the horse in the generated video frames. The slide references a paper by Khachatryan et al. and is credited to Mike Shou from NUS. Text transcribed from the image: QCC Text2 Video-Zero Use Stable Diffusion to generate videos without any finetuning • Start from noises of similar pattern: given the first frame's noise, define a global scene motion, used to translate the first frame's noise to generate similar initial noise for other frames ~N(0,1) x=DDIM Backward(x+, At, SD) a=Wx(x) x=DDPM Forward(a, At) for k=2,3,...,m Text prompt: "A horse is galloping on the street" Convolution Linear Projection Linear Projection Linear Projection Cross-Frame Attention Khachatryan et al., "Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators," arXiv 2023. Softmax (()) XT 0 Cross-Attention FFN Transformer Block x2 Salient Object Detector Copyright Mike Shou, NUS 115