A detailed scientific poster is displayed at a conference or presentation venue. The poster, titled "Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework," outlines research on generating 2D avatars from video footage using a novel diffusion-based system. The guidelines, results, and system overview are presented with detailed text, diagrams, and sample images. The introduction highlights current challenges in generating high-quality half-body/full-body human videos, while the methods section discusses their approach including 'frame-wise motion-to-appearance diffusing' and 'batch-overlapped temporal denoising.' System results exhibit visual comparisons and quantitative evaluations, demonstrating the effectiveness of their model. The poster also features institutional logos at the top and includes icons for additional contact or resource information, indicating its presence at an academic or professional event. Attendees are visible in the background, walking through the exhibition space. Text transcribed from the image: ICT Introduction 1931 Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework Challenges: Generating a half-body/full-body human video is still challenging: a) GAN-based solutions limits the visual quality of the generated videos. b) Current diffusion-based approaches suffers from temporal inconsistency. O Make-Your-Anchor: a novel diffusion-based system generating anchor-style videos: a) Frame-wise motion-to-appearance diffusing binds movements into appearance with a two-stage training. b) Batch-overlapped temporal denoising generates temporal-consistent human video with arbitrary long duration. High-quality human videos compared to SOTA GAN/diffusion-based approaches. Method Structure-Guided Diffusion Model (SGDM) 1 Institute of Computing Technology, Chinese Academy of Sciences 2 Tencent Al Lab 3 National Cheng-Kung University Ziyao Huang, Fan Tang, Yong Zhang, Xiaodong Cun?, Juan Cao, Jintao Li, Tong-Yee Lee³ Video for model training. A Video for model training Make your anchor! System overview Motion condition. Motion condition CVPR SEATTLE, WA JUNE 17-21, 2024 User study scores App Temporal Results Quantitative results AN Method SO Peeraine CNY heration Quin! ** Generated anchor videos, A P 230 DURAS 199 M TPS 210 118 " 476 Dreamse 194 227 IN 租 DisCo145 2.32 LAS IN Ours 335 391 Our method achieves better performance on image quality, temporal consistency. and structure preservation. The rating score is on a scale from one to five, where five is the highest score, and one is the lowest Generated ancher videos Appearanc & Motion Setting: we formulate 2D avatar generation as a learning paradigm from a video of one identity, where a personalized diffusion model that could generate a human video in the same scenario as the input video. Input: we utilize human 3D mesh rendered from SMPL-X parameters as our input poses, which is smoothness and structural preservation, especially in hand gestures. Face SGDM Mesh Batch-overlapped Temporal Denoising Batch 1734 Identity-Specific Face Enhancement Ed Vole Oun DiCe Dream TPS Posting Qualitative results compared with other methods Our methods achieve accurate gestures and high-quality generation with facial details. 0 Balch Face Make Loopery.... ➤ Frame-wise Motion-to-Appearance Diffusing: We train our model frame-wise to bind appearance and motion. A structure-guided diffusion model (SGDM) to generate human image under the control of 3D mesh conditions frame-by-frame. Pre-training on multiple identities for motion generation and fine-tuning on a single identity to bind the movements to the appearance. Init Noise Sequence 27 ➤Batch-overlapped Temporal Denoising: We inference video-wise to get consistent and arbitrary length of videos. Extends the 2D U-Net to a 3D U-Net video diffusion model by training-free all-frame cross-attention module. An effective and simple algorithm operated on multi-batch noise to generate an arbitrary length of anchor video: ➤Identity-Specific Face Enhancement: We further revise and enhance the face region. An inpainting-based approach with crop-and-blend operations Cross-person motion results (Left is pose, and right is output) When the cross-motion is in a similar style, the generated results are in good quality from the holistic generated body. Full-body results Audio-driven results GitHub