The image showcases a scientific poster presentation titled "TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models". The research is attributed to Zhongwei Zhang, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Ting Yao, Yang Cao, Tao Mei from the University of Science and Technology of China and HiDream.ai Inc. The poster is part of CVPR (Conference on Computer Vision and Pattern Recognition) held in Seattle, WA in June 2023. It is displayed at booth number 375. Key sections of the poster outline: 1. Typical Image-to-Video Diffusion: - Comparative details between independent noise prediction for each frame and residual-like noise prediction with image noise prior. - It highlights issues such as lack of temporal coherence modeling. 2. Temporal Residual Learning with Image Noise Prior: - Descriptions of Temporal Residual Learning and its components, which aim to improve temporal coherence by taking image noise prior as reference noise. 3. Temporal Noise Fusion Module (TNF): - Details on residual noise learning to capture motion dynamics and the use of an attention mechanism to merge reference and residual noise. 4. Experiments and Analysis: - Comparisons with State-of-the-Art (SOTA) methods. - Performance metrics of multiple models. - First frame condition analysis. - TRIP module evaluation via graph and visual comparison tables. The poster’s layout includes diagrams, formulas, results tables, and visual samples, providing a comprehensive overview of the research and findings. Additionally, a QR code at the bottom left links to the project page for more detailed information. Overall, the poster efficiently communicates complex technical advancements in video diffusion models, boosting temporal coherence with novel techniques. Text transcribed from the image: 1958 Science and li HiDream.ai TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models Zhongwei Zhang', Fuchen Long², Yingwei Pan², Zhaofan Qiu², Ting Yao², Yang Cao¹, Tao Mei² 1University of Science and Technology of China 2HiDream.ai Inc. 1.Typical Image-to-Video Diffusion V.S. TRIP 2.Temporal Residual Learning with Image Noise Prior Gaussian Noise Gaussian Noise First Frame Latent First Frame Input Video 2D VAE Noise se Prior Estimation 3D-UNet 3D-UNet Image Noise Prior Noise Estimation Residue Estimation Forward Diffusion Text Prompt Feature 375 CVPR SEATTLE, WA JUNE 17-21, 2024 4. Experiments and Analysis Comparisons with SOTA methods. Table 1. Performances of F-Consistency (F-Consistency: consis tency among the first four frames, F-Consistency consistency among all frames) and FVD on WebVid-10M. Approach Generated Video first frame latent code z Iterative Denoising Video Diffusion: Zt = √ātzo (a) Independent noise prediction ➤ Typical Image-to-Video Diffusion • Independent noise prediction of each frame (b) Residual-like noise prediction TNF Module 3D-UNet T2V-Zero [2] VideoComposer (60) TRIP F-Consistency, (F-Consistency (1) FVD) 91.50 8878 95.36 12.15 92.52 279 96.41 Image Noise Prior Estimation Table 2. Performances of averaged FID and FVD over four scene categories on DTDB dataset. Table 4. Human evaluation of the perference ratios between TRP Approach AL [1] Zero-shot No FID (1) 65.1 FVD (1) 934.2 vs. other approaches on WebVid-10M dataset Evaluation Items ..TIV-Zero VideoComposer Temporal Coherence 969x31 CINN [9] No 31.9 Motion Fidelity 93.862 81.3 182 +√1-e,e~N(0,1), z₁ = {z} TRIP Yes 248 433.9 Visual Quality 906 vs. 94 87.5xx125 with image noise prior Typical noise formulation: zł 1 - αμεί Table 3. Performances of FID and FVD on MSR-VTT dataset. z = € € (z₁, t, c) Apprach Model Type FID FVD) Image noise prior Cog Video (28) 12V 2139 Make-A-Video (52 T2V 117 VideoComposer (0) TV twi z₁₁₁ = =A ModelT2V(3) T2V Ignoring the inherent relation between the given image TRIP noise formulation: Sāt ValeComposer 31.25 BY 113 12V Temporal residual noise et - E √1-āt Analysis of the first frame condition. Table 5. Performance comparisons among different condition ap- and each subsequent frame in video generation Lack of temporal coherence modeling ➤ Temporal Residual Learning with Image Noise Prior (TRIP) • Take image noise prior as reference noise to amplify alignment across frames Residual noise learning is used to capture motion dynamics An attention mechanism merges reference and residual noise . to enhance temporal coherence ➤ Project Page: https://trip-i2v.github.io/TRIP/ 3.Temporal Noise Fusion Module (TNF) A Wmation 26 2 Typical video denoising Temporal noise fusion for 12V denoising Time step t Δε (e) Adaptive LayerNorm- -Attention Adaptive LayerNorm L=EN(0.1).t.c.ille- proaches on WebVid-10M dataset. Model F-Consistency, (t) F-Consistency (1) FVD()) TRIP 94.77 96.13 413 96.201 TRIPTE 95.17 39.8 TRIP 95.36 96.41 38.9 Analysis of the TRIP module. Table 6. Evaluation of temporal residual learning in terms of F Consistency and FVD on WebVid-10M dataset Model F-Consistency, (1) F-Consistency (7) FVD) TRIP 9466 95.92 TRIP 95.22 95.96 TRIP 95.36 96.41 38.9