This image features a scientific poster presented at a conference. The poster showcases a research project titled "AVID: Any-Length Video Inpainting with Diffusion Model," a collaboration between Rutgers University and Meta. The poster outlines the following sections: 1. **Introduction**: - Challenges for video inpainting, emphasizing temporal consistency and various editing types such as object swap, retexturing, and uncropping. - Key features of their approach, addressing arbitrary video durations with high fidelity. 2. **Method Overview**: - Temporal consistency using motion modules. - Adjustable structure guidance for various fidelity requirements. - Handling arbitrary durations for any-length video inference. - Utilizing Temporal MultiDiffusion and Middle-frame Attention Guidance. 3. **Approach**: - A two-step methodology comprising training and inference phases. - Construction of motion modules and guidance modules for optimal performance. - Utilization of a denoising step and attention mechanism for enhanced results. 4. **Experiments**: - Comparative analysis against other inpainting methods. - Evaluation metrics include background preservation (BP), text-video alignment (TVA), text-video fidelity (TVF), and temporal consistency (TC). - Experimental results illustrated with tables and figures, demonstrating the effectiveness of the proposed model. The poster is visually rich with images, tables, figures, and diagrams, providing a comprehensive overview of the research, methodology, and experimental outcomes. QR codes and institutional logos of Rutgers University and Meta are also displayed at the top. Text transcribed from the image: P RUTGERS Meta THE STATE UNIVERSITY OF NEW JERSEY Introduction: ➤ Challenges for video inpainting: ➤ Temporal consistency ➤ Various editing types -> different levels of structural fidelity Object swap (e.g. sedan->sport cat) ➤ Retexturing (e.g. white coat-> red one) ➤ Uncropping (e.g. 256x512->512x512) ➤ Arbitrary duration × ต "A MINI Cooper driving down a road." (5.3 s) "A yellow maple leaf" (2.7 s) "A train traveling over a bridge in the mountains."(8.0 s) AVID: Any-Length Video Inpainting with Diffusion Model Zhixing Zhang, Bichen Wu², Xiaoyan Wang2, Yaqiao Luo², Luxin Zhang², Yinan Zhao², Peter Vajda², Dimitris N. Metaxas¹, Licheng Yu² 1Rutgers University 2GenAl, Meta Approach: In the training phase of our methodology, we employ a two-step approach. ➤ Motion modules are integrated after each layer of the primary Text- to-Image (T21) inpainting model, optimized for the video in-painting task via synthetic masks applied to the video data. ➤ During the second training step, we fix the parameters in the UNet, Eg, and train a structure guidance module se, leveraging a parameter copy from the UNet encoder. During inference, ➤ for a video of length N', we construct a series of segments, each comprising N successive frames. Throughout each denoising step, results for every segment are computed and aggregated. Vo Um m (a) Motion module training ~N(0,1) (b) Structure guidance training -N(0,1) Loss Vo Diffe Ф Concatenate Vmm Motion modules Base T21 weights (c) Inference Eg wc, V-1 vit Experiments: Task Metric BP PF Re-texturing Uncropping Object swap TA TC BP TA TC BP TA TC 43.1 31.3 93.6 41.4 31.1 92.5 41.4 31.2 92.4 T2V0 49.0 31.4 96.5 47.3 30.1 94.9 47.9 30.6 95.0 55.7 31.2 96.4 71.0 31.5 96.5 64.5 32.1 95.5 42.3 31.3 97.2 41.1 31.5 96.5 40.7 32.0 96.3 VC Ours 90 MM SEATTLE, WA Ours Per-frame LL Object swag Uncropping ➤ We compare our method against several approaches, inc frame in-painting (PF) using Stable Diffusion In-painting, Text2Video-Zero (T2V0), and Video Composer (VC) on dit video inpainting sub-tasks and evaluate generated results different metrics, including background preservation (BP = better), text-video alignment (TA, ↑ better), and temporal consistency (TC, ↑ better). * indicates structure guidance for VC and our approach. ➤ In our user preference studies, we juxtaposed our method per-frame in-painting techniques by evaluating prominent such as Diffusion-based Image In-painting, Text2 Video-Ze and VideoComposer (VC), assessing their performances a various tasks. Re-texturing: "A purple car driving down a road." Object swap: "A flamingo swimmi Method Overview: ➤ Temporal consistency → motion modules ➤ Various fidelity requirements → adjustable structure guidance ➤ Arbitrary duration → zero-shot any-length video inference ➤ Temporal MultiDiffusion ➤ Middle-frame Attention Guidance ➤ At inference, during each denoising step and within every self- attention layer, we retain the KIN'/2] and VIN'/2] values from the frame in the middle of the video. For the video's i-th frame, we utilize its pixel queries, denoted as Qi, to compute an auxiliary attention feature map. This is subsequently fused with the existing self- attention feature map within the same layer. KN/2 VN'/2] self-attention