This image is taken from a presentation slide discussing "Text2Video-Zero," a framework for generating videos from text prompts using Stable Diffusion without any finetuning. The slide outlines the process, starting from noises of a similar pattern, which are used to translate the first frame's noise to generate similar initial noise for other frames. This ensures the global scene motion is consistent. The diagram on the slide illustrates the process, including the DDIM Backward and DDPM Forward operations, and shows how the text prompt "A horse is galloping on the street" can drive the generation process. A salient object detector is also depicted, indicating its role in identifying and maintaining key features, such as the horse in the generated video frames. The slide references a paper by Khachatryan et al. and is credited to Mike Shou from NUS.
Text transcribed from the image:
QCC
Text2 Video-Zero
Use Stable Diffusion to generate videos without any finetuning
• Start from noises of similar pattern: given the first frame's noise, define a global scene motion,
used to translate the first frame's noise to generate similar initial noise for other frames
~N(0,1)
x=DDIM Backward(x+, At, SD)
a=Wx(x)
x=DDPM Forward(a, At)
for k=2,3,...,m
Text prompt: "A horse is galloping on the street"
Convolution
Linear
Projection
Linear
Projection
Linear
Projection
Cross-Frame Attention
Khachatryan et al., "Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators," arXiv 2023.
Softmax (())
XT
0
Cross-Attention
FFN
Transformer Block x2
Salient Object
Detector
Copyright Mike Shou, NUS
115