A presentation slide titled "Text2Video-Zero" describes using Stable Diffusion to generate videos without any additional fine-tuning. The slide explains starting from noises of a similar pattern; given the first frame's noise, a global scene motion is defined and used to generate similar initial noise for other frames. A detailed mathematical process and diagram illustrate this concept. The text prompt used as an example is "A horse is galloping on the street." The slide references a paper by Khachatryan et al., titled "Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators," and is attributed to Mike Shou from the National University of Singapore (NUS). The presentation seems to be observed by a few attendees, with the back of their heads visible in the foreground.
Text transcribed from the image:
QCC
Text2 Video-Zero
Use Stable Diffusion to generate videos without any finetuning
• Start from noises of similar pattern: given the first frame's noise, define a global scene motion,
used to translate the first frame's noise to generate similar initial noise for other frames
~N(0,1)
x=DDIM Backward(x+, At, SD)
a=Wx(x)
x=DDPM Forward(a, At)
for k=2,3,...,m
Text prompt: "A horse is galloping on the street"
Convolution
Linear
Projection
Linear
Projection
Linear
Projection
Cross-Frame Attention
Khachatryan et al., "Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators," arXiv 2023.
Softmax (())
XT
0
Cross-Attention
FFN
Transformer Block x2
Salient Object
Detector
Copyright Mike Shou, NUS
115