A detailed scientific research poster titled "MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation" by researchers Yahui Wu, Jiamin Bao, Wenqing Zhang, Dachen Yin, Yao Yao, Jingxin Zhang, Pei Dai, Zhipeng Zhou, Chuanyu Wang, Kaidi Guo, Huiyu Yami, Xiaoyu Sun, Chong Luo, and Baining Guo from Microsoft Research. The poster is presented at the CVPR (Conference on Computer Vision and Pattern Recognition) in Seattle, WA, scheduled for June 17-21, 2024. **Main Section Breakdown:** 1. **Overview:** - **Motivation:** Introduces the aim to improve text-to-video generation by focusing on handling image generation and then video generation. - **Image&Text-to-Video Architecture:** A detailed diagram of the proposed diffusion-based image&text-to-video architecture. 2. **Method:** - Describes the novel AppearNet approach for model enhancement and Appearance Noise Prior for denoising. - Diagram illustrating the AppearNet and Appearance Noise Prior integration. 3. **Experiments:** - **Quantitative Results:** Tables showing performance metrics and evaluations demonstrating MicroCinema’s ability to generate high-quality videos. - **Human Evaluation:** Graphs displaying human evaluation metrics for generated videos. - **Ablations:** Discusses the impacts of various ablation studies with visual and statistical results. - **Qualitative Results:** Various video sequences and model outputs comparing the default approach with the introduction of ANP and other methods. The poster is located at booth number 350 and features comprehensive content including visual results and graphs to highlight the efficacy of the proposed approach in text-to-video generation. Text transcribed from the image: Highlight Microsoft Research 350 微软亚洲研究院 1. Overview Motivation: MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation Yanhui Wang, Jianmin Bao*, Wenming Weng', Ruoyu Feng', Dacheng Yin', Tao Yang',Jingxu Zhang', Qi Dai, Zhiyuan Zhao, Chunyu Wang, Kai Qiu, Yuhui Yuan,Xiaoyan Sun', Chong Luo12+, Baining Guo² ➤MicroCinema aims to improve text-to-video generation by dividing the process into two stages: leveraging advanced text-to-image models for generating high- quality images and then focusing on motion dynamics for video generation. Text prompt "A corgi is running on SO, DALL-E Midjourney Image&Text- to-Video Temporal Interpolation the grass." T21 TIZV MicroCinema Pipeline Image&Text-to-Video Architecture: Crane 30 Encoder VA Appearance Undeading AppearNet Downsangle Test Prompt Argi Recontruction Embedding 30 fek Middle Block 30 Decoder Overall architecture of our proposed diffusion-based image&text-to-video model in MicroCinema. The proposed AppearNet aims to provide appearance information for video generation. The introduced effective and distinctive Appearance Noise Prior tailored for fine-tuning text-to-image diffusion models. AppearNet ACVPR A SEATTLE, WA JUNE 17-21, 2024 3. Experiments Quantitative Results: ➤Microcinema generates high-quality videos. 2. Method Main branch-th 3D Resblock AppearNet j-th 3D Resblock Spatial Layers Spatial Layers 3D GN 3D GN SILU Temporal Layers Methods 3x1x1 Conv 3x1x1 Conv Spatial Layers 3D GN ➤ To enhance the model's capability in handling reference center frame, we introduce the Appearance Injection Network. ✓ ControlNet Add-to-EncDec SPADE Group Norm: o (+1)h + Ba Appearance Noise Prior 3x1x1 Conv UCF-101[38] FVD Using WebVid-10M and additional data for training Make-A-Video [36] 367.23 33.00 Human evaluation: Human Evaluation of VideoLDM and Ours LLLL visual qual overall LLLL MSR-VTT[52] IS ↑ FVD CLIPSIM† motion qual 0.3049 0.3005 Appear Emb VideoFactory [42] 410.00 ModelScope (41) 410.00 550.00 0.2930 Lavie [45] VidRD [9] 526.30 0.2949 363.19 39.37 PYoCo [7] 355.19 47.76 0.3204 Using WebVid-10M only for training LVDM [10] Cog Video [17] Main branch AppearNet -th 30 Attention Block -th 30 Attention Block Appear Emb Magic Video [56] Video LDM [5] VideoComposer [43] VideoFusion [23] SimDA [51] 641.80 701.59 25.27 12941 699.00 550.61 33.45 742.00 0.2381 0.2631 998.00 580 639.90 17.49 581.00 MicroCinema (Ours) 394.46 342.86 456.00 35.42 538.00 37.46 377.40 0.29291 0.2932 0.2795 0.2945 0.3072 0.2967 Zero-Shot Concat Yes 688.92 Add-to-Dec Yes 27.90 589.59 Add-to-EncDec Yes 27.25 525.02 Comparison on the zero-shot text-to-video generation performance on UCF-101[38] and MSR-VTT[52]. Add-to-EncDec-SPADE Yes 29.63 508.56 Ablation study for appearance injection methods. Leveraging the denoising property of the diffusion model, we introduce Appearance Noise Prior by adding an appropriate amount of the center frame into the noise Lete [e, e2,..., EN] denote the noise corresponding to a video clip with N frames. The training noise for our model is defined as: €² = λz² + € For training, we adhere to the stable diffusion training setting and use noise prediction with the following loss function: Lo = Eq. (2012+) 2.) [|| fo (z₁, t, 2°, c) - ||²], ➤ During the inference stage, our method allows for the direct application of existing ODE sample algorithms. The sampling noise during the inference stage is (A+)+ En, where is sampled from N(0, 1). Adding more repeat frames (y) during the inference will result in videos with less motion but increased stability, which enhances both visual quality and FVD. 250 700 550 Show-1 [53] Qualitative Results: Concat, w/o ANP Concat, w/ ANP text align Ablations: Method AIN, w/o ANP visual qual AIN, w/ ANP (Ours) overal IS (1) 15.83 FVD (!) Qualitative ablation studies of Appearance Injection Network (AIN) and Appearance Noise Prior (ANP). 500 0.00 0.01 002 0.04 0.05 Compare with other methods. The generated videos from our model shows a clear and coherent motion.