This image is a scientific poster titled "View-Change Human-Centric Video Editing" presented by Jussi Keppo, Ying Shan, and Mike Zheng Shou from Tencent PCG at the CVPR conference in Seattle, WA from June 17-21, 2024. The poster details the DynVideo-E Framework, which is a method for advanced human-centric video editing that integrates a 3D dynamic human space and a 3D static background space. The framework encompasses two primary workflows: 1. The orange flowchart for editing an animatable 3D dynamic human space with a multi-view multi-pose SDS. 2. The green flowchart for editing the 3D background space with style transfer loss. The poster visually demonstrates the processes and results of human-captured videos being transferred into different background scenes through various steps involving Video-NeRF and Style Reference Image (S*) inputs. Metrics such as CLIPScore, Textual Faithfulness, Temporal Consistency, and Overall Quality are used to evaluate the performance of several models, with DynVideo-E showing superior results in most categories. Experimental results highlight the comparison against models like Text2Video-Zero, Rerender-A-Video, Text2LIVE, StableVideo, and CoDeF, showing that DynVideo-E consistently scores higher in textual faithfulness, temporal consistency, and overall quality. Additionally, an ablation study is presented to show the impact of removing different components of the model. Overall, the poster showcases an innovative approach to creating dynamic and visually coherent video edits through advanced neural rendering techniques. Text transcribed from the image: View-Change Human-Centric Video Editing Jussi Keppo², Ying Shan³, Mike Zheng Shou ab, Tencent PCG 1 DynVideo-E Framework Background Static Space Render NERF VS Background Ref Style Is Video-NeRF Ref Pose Pr Motion H Frame Pose P CVPR SEATTLE, WA JUNE 17-21, 2024 Contact: jiawei.liu@u.nus.edu VGG -16 Rendered Feature Contents LNNFM VGG -16. Style Feature Contents Ref Image I Random Camera V LREC Ref Camera Vr Render L3 3D ᎾᏙ Ө SDS 3D Zero123 Prior NeRFVH Render L2D SDS. 2D Personalized SD Prior + Foreground Canonical Space Random Camera V Frozen Train Represents human-centric video as a 3D dynamic human space and a 3D static space. Orange flowchart: Edit animatable 3D dynamic human space with multi-view multi-pose SDS. Green flowchart: Edit 3D background space with style transfer loss. • Edited videos and free-viewpoint contents are rendered from edited video-NeRF model. Text2 Video-Zero [20] DynVideo-E (Ours) Experimental Results HUMAN PREFERENCE Textual Faithfulness (†) Temporal Consistency (†) 9.17 v.s. 90.83 (Ours) 21.25 v.s. 78.75 (Ours) 6.67 v.s. 93.33 (Ours) 25.00 v.s. 75.00 (Ours) 3.81 v.s. 96.19 (Ours) 26.67 v.s. 73.33 (Ours) 4.29 v.s. 95.71 (Ours) 24.29 v.s. 75.71 (Ours) 1.25 v.s. 98.75 (Ours) 3.75 v.s. 96.25 (Ours) Rerender-A-Video [57] METRICS CLIPScore (1) 26.70 26.11 Text2LIVE [2] Stable Video [6] CoDeF [33] 22.77 22.02 16.77 31.31 Ablation components Full model w/o Super-solution Backpack Lab 0.756 0.647 Ref Subject (a) Full model w/o Super-solution 0.736 0.645 w/o Super-solution, Rec 0.728 0.617 w/o Super-solution, Rec, 2D SDS 0.679 0.517 w/o Super-solution, Rec, 3D SDS w/o Super-solution, Rec, 3D SDS, 2D LORA 0.698 0.711 0.613 0.539 Overall Quality (†) 12.08 v.s. 87.92 (Ours) 9.58 v.s. 90.42 (Ours) 9.05 v.s. 90.95 (Ours) 6.19 v.s. 93.81 (Ours) 1.25 v.s. 98.75 (Ours) w/o Super-solution, w/o Super-solution, w/o Super-solution, w/o Super-solution, Rec, 2D SDS Rec, 3D SDS Rec, 3D SDS, 2D LOR Reci