A detailed scientific research poster titled "MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation" by researchers Yahui Wu, Jiamin Bao, Wenqing Zhang, Dachen Yin, Yao Yao, Jingxin Zhang, Pei Dai, Zhipeng Zhou, Chuanyu Wang, Kaidi Guo, Huiyu Yami, Xiaoyu Sun, Chong Luo, and Baining Guo from Microsoft Research. The poster is presented at the CVPR (Conference on Computer Vision and Pattern Recognition) in Seattle, WA, scheduled for June 17-21, 2024.

**Main Section Breakdown:**

1. **Overview:**
   - **Motivation:** Introduces the aim to improve text-to-video generation by focusing on handling image generation and then video generation.
   - **Image&Text-to-Video Architecture:** A detailed diagram of the proposed diffusion-based image&text-to-video architecture.
   
2. **Method:**
   - Describes the novel AppearNet approach for model enhancement and Appearance Noise Prior for denoising.
   - Diagram illustrating the AppearNet and Appearance Noise Prior integration.

3. **Experiments:**
   - **Quantitative Results:** Tables showing performance metrics and evaluations demonstrating MicroCinema’s ability to generate high-quality videos.
   - **Human Evaluation:** Graphs displaying human evaluation metrics for generated videos.
   - **Ablations:** Discusses the impacts of various ablation studies with visual and statistical results.
   - **Qualitative Results:** Various video sequences and model outputs comparing the default approach with the introduction of ANP and other methods.

The poster is located at booth number 350 and features comprehensive content including visual results and graphs to highlight the efficacy of the proposed approach in text-to-video generation.
Text transcribed from the image:
Highlight
Microsoft
Research
350
微软亚洲研究院
1. Overview
Motivation:
MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
Yanhui Wang, Jianmin Bao*, Wenming Weng', Ruoyu Feng', Dacheng Yin', Tao Yang',Jingxu Zhang',
Qi Dai, Zhiyuan Zhao, Chunyu Wang, Kai Qiu, Yuhui Yuan,Xiaoyan Sun', Chong Luo12+, Baining Guo²
➤MicroCinema aims to improve text-to-video generation by dividing the process
into two stages: leveraging advanced text-to-image models for generating high-
quality images and then focusing on motion dynamics for video generation.
Text prompt
"A corgi is
running on
SO,
DALL-E
Midjourney
Image&Text-
to-Video
Temporal
Interpolation
the grass."
T21
TIZV
MicroCinema Pipeline
Image&Text-to-Video Architecture:
Crane
30 Encoder
VA
Appearance
Undeading
AppearNet
Downsangle
Test Prompt Argi
Recontruction
Embedding
30 fek
Middle Block
30 Decoder
Overall architecture of our proposed diffusion-based image&text-to-video model
in MicroCinema. The proposed AppearNet aims to provide appearance
information for video generation. The introduced effective and distinctive
Appearance Noise Prior tailored for fine-tuning text-to-image diffusion models.
AppearNet
ACVPR
A SEATTLE, WA JUNE 17-21, 2024
3. Experiments
Quantitative Results:
➤Microcinema generates
high-quality videos.
2. Method
Main branch-th 3D Resblock
AppearNet j-th 3D Resblock
Spatial Layers
Spatial Layers
3D GN
3D GN
SILU
Temporal
Layers
Methods
3x1x1 Conv
3x1x1 Conv
Spatial Layers
3D GN
➤ To enhance the model's
capability in handling
reference center frame,
we introduce the
Appearance Injection
Network.
✓ ControlNet
Add-to-EncDec
SPADE Group Norm:
o (+1)h + Ba
Appearance Noise Prior
3x1x1 Conv
UCF-101[38]
FVD
Using WebVid-10M and additional data for training
Make-A-Video [36] 367.23 33.00
Human evaluation:
Human Evaluation of VideoLDM and Ours
LLLL
visual qual
overall
LLLL
MSR-VTT[52]
IS ↑ FVD CLIPSIM†
motion qual
0.3049
0.3005
Appear Emb
VideoFactory [42]
410.00
ModelScope (41)
410.00
550.00
0.2930
Lavie [45]
VidRD [9]
526.30
0.2949
363.19 39.37
PYoCo [7]
355.19 47.76
0.3204
Using WebVid-10M only for training
LVDM [10]
Cog Video [17]
Main branch
AppearNet
-th 30 Attention Block
-th 30 Attention Block
Appear Emb
Magic Video [56]
Video LDM [5]
VideoComposer [43]
VideoFusion [23]
SimDA [51]
641.80
701.59 25.27 12941
699.00
550.61 33.45
742.00
0.2381
0.2631
998.00
580
639.90 17.49 581.00
MicroCinema (Ours)
394.46
342.86
456.00
35.42 538.00
37.46 377.40
0.29291
0.2932
0.2795
0.2945
0.3072
0.2967
Zero-Shot
Concat
Yes
688.92
Add-to-Dec
Yes
27.90 589.59
Add-to-EncDec
Yes
27.25
525.02
Comparison on the zero-shot text-to-video generation
performance on UCF-101[38] and MSR-VTT[52].
Add-to-EncDec-SPADE
Yes
29.63
508.56
Ablation study for appearance injection methods.
Leveraging the denoising property of the diffusion model, we
introduce Appearance Noise Prior by adding an appropriate amount of
the center frame into the noise
Lete
[e, e2,..., EN] denote the noise corresponding to a video clip
with N frames. The training noise for our model is defined as:
€² = λz² + €
For training, we adhere to the stable diffusion training setting and use
noise prediction with the following loss function:
Lo = Eq. (2012+)
2.) [|| fo (z₁, t, 2°, c) - ||²],
➤ During the inference stage, our method allows for the direct
application of existing ODE sample algorithms. The sampling noise
during the inference stage is (A+)+ En, where is sampled
from N(0, 1).
Adding more repeat frames
(y) during the inference will
result in videos with less
motion but increased stability,
which enhances both visual
quality and FVD.
250
700
550
Show-1 [53]
Qualitative Results:
Concat, w/o ANP
Concat, w/ ANP
text align
Ablations:
Method
AIN, w/o ANP
visual qual
AIN, w/ ANP (Ours)
overal
IS (1)
15.83
FVD (!)
Qualitative ablation studies of Appearance Injection Network (AIN) and Appearance Noise Prior (ANP).
500
0.00
0.01
002
0.04
0.05
Compare with other methods. The generated videos
from our model shows a clear and coherent motion.