Detailed Caption: 

This image showcases a scientific poster presentation from Rutgers University and Meta, presented at a conference in Seattle. The poster is titled "AVID: Any-Length Video Inpainting with Diffusion Model" and includes work by Zhixing Zhang, Bichen Wu, Xiaoyan Wang, Yaqiao Luo, Luxin Zhang, Yinan Zhao, Peter Vajda, Dimitris N. Metaxas, and Licheng Yu. 

The poster outlines an advanced methodology for video inpainting, addressing challenges such as temporal consistency, various editing types requiring different levels of structural fidelity, and arbitrary duration inpainting. It compares different approaches like uncropping and object swap (e.g., sedan to sports car), detailing the methods used to overcome these challenges.

Key sections of the poster include:

1. **Introduction**: Discusses the challenges of video inpainting and editing types.
2. **Method Overview**: Provides a summary of their approach focusing on temporal consistency, adjustable structure guidance, and multi-diffusion techniques.
3. **Approach**: Describes the two-step process of integrating motion modules after each layer of the primary Text-to-Video generation model and using adjustable structure guidance.
4. **Experiments**: Presents a detailed comparison of their method against other models using various inpainting tasks and metrics to measure performance, such as Mean Perceived Error (MPE) and temporal consistency.
5. **Results and Visuals**: Includes various images and graphs demonstrating the effectiveness and applications of their model across different scenarios, emphasizing the improvements made by their approach.

This poster represents cutting-edge advancements in video inpainting technology, reflecting collaborative efforts in computer vision and AI research to enhance video editing capabilities.
Text transcribed from the image:
P
RUTGERS Meta
THE STATE UNIVERSITY
OF NEW JERSEY
Introduction:
➤ Challenges for video inpainting:
➤ Temporal consistency
➤ Various editing types -> different levels of structural fidelity
Object swap (e.g. sedan->sport cat)
➤ Retexturing (e.g. white coat-> red one)
➤ Uncropping (e.g. 256x512->512x512)
➤ Arbitrary duration
×
ต
"A MINI Cooper driving down a road." (5.3 s)
"A yellow maple leaf" (2.7 s)
"A train traveling over a bridge in the mountains."(8.0 s)
AVID: Any-Length Video Inpainting with Diffusion Model
Zhixing Zhang, Bichen Wu², Xiaoyan Wang2, Yaqiao Luo², Luxin Zhang²,
Yinan Zhao², Peter Vajda², Dimitris N. Metaxas¹, Licheng Yu²
1Rutgers University 2GenAl, Meta
Approach:
In the training phase of our methodology, we employ a two-step
approach.
➤ Motion modules are integrated after each layer of the primary Text-
to-Image (T21) inpainting model, optimized for the video in-painting
task via synthetic masks applied to the video data.
➤ During the second training step, we fix the parameters in the UNet,
Eg, and train a structure guidance module se, leveraging a
parameter copy from the UNet encoder.
During inference,
➤ for a video of length N', we construct a series of segments, each
comprising N successive frames. Throughout each denoising step,
results for every segment are computed and aggregated.
Vo
Um m
(a) Motion module training
~N(0,1)
(b) Structure guidance training
-N(0,1)
Loss
Vo
Diffe
Ф
Concatenate
Vmm
Motion modules
Base T21 weights
(c) Inference
Eg wc,
V-1
vit
Experiments:
Task
Metric BP
PF
Re-texturing
Uncropping Object swap
TA TC BP TA TC BP TA TC
43.1 31.3 93.6 41.4 31.1 92.5 41.4 31.2 92.4
T2V0 49.0 31.4 96.5 47.3 30.1 94.9 47.9 30.6 95.0
55.7 31.2 96.4 71.0 31.5 96.5 64.5 32.1 95.5
42.3 31.3 97.2 41.1 31.5 96.5 40.7 32.0 96.3
VC
Ours
90
MM SEATTLE, WA
Ours
Per-frame
LL
Object swag
Uncropping
➤ We compare our method against several approaches, inc
frame in-painting (PF) using Stable Diffusion In-painting,
Text2Video-Zero (T2V0), and Video Composer (VC) on dit
video inpainting sub-tasks and evaluate generated results
different metrics, including background preservation (BP =
better), text-video alignment (TA, ↑ better), and temporal
consistency (TC, ↑ better). * indicates structure guidance
for VC and our approach.
➤ In our user preference studies, we juxtaposed our method
per-frame in-painting techniques by evaluating prominent
such as Diffusion-based Image In-painting, Text2 Video-Ze
and VideoComposer (VC), assessing their performances a
various tasks.
Re-texturing: "A purple car driving down a road."
Object swap: "A flamingo swimmi
Method Overview:
➤ Temporal consistency
→ motion modules
➤ Various fidelity requirements →
adjustable structure guidance
➤ Arbitrary duration → zero-shot any-length video inference
➤ Temporal MultiDiffusion
➤ Middle-frame Attention Guidance
➤ At inference, during each denoising step and within every self-
attention layer, we retain the KIN'/2] and VIN'/2] values from the
frame in the middle of the video. For the video's i-th frame, we utilize
its pixel queries, denoted as Qi, to compute an auxiliary attention
feature map. This is subsequently fused with the existing self-
attention feature map within the same layer.
KN/2
VN'/2]
self-attention