A detailed scientific poster is displayed at a conference or presentation venue. The poster, titled "Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework," outlines research on generating 2D avatars from video footage using a novel diffusion-based system. The guidelines, results, and system overview are presented with detailed text, diagrams, and sample images. The introduction highlights current challenges in generating high-quality half-body/full-body human videos, while the methods section discusses their approach including 'frame-wise motion-to-appearance diffusing' and 'batch-overlapped temporal denoising.' System results exhibit visual comparisons and quantitative evaluations, demonstrating the effectiveness of their model. The poster also features institutional logos at the top and includes icons for additional contact or resource information, indicating its presence at an academic or professional event. Attendees are visible in the background, walking through the exhibition space.
Text transcribed from the image:
ICT
Introduction
1931
Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework
Challenges: Generating a half-body/full-body
human video is still challenging:
a) GAN-based solutions limits the visual
quality of the generated videos.
b) Current diffusion-based approaches
suffers from temporal inconsistency.
O Make-Your-Anchor: a novel diffusion-based
system generating anchor-style videos:
a) Frame-wise motion-to-appearance
diffusing binds movements into
appearance with a two-stage training.
b) Batch-overlapped temporal denoising
generates temporal-consistent human
video with arbitrary long duration.
High-quality human videos compared to
SOTA GAN/diffusion-based approaches.
Method
Structure-Guided Diffusion Model
(SGDM)
1 Institute of Computing Technology, Chinese Academy of Sciences 2 Tencent Al Lab 3 National Cheng-Kung University
Ziyao Huang, Fan Tang, Yong Zhang, Xiaodong Cun?, Juan Cao, Jintao Li, Tong-Yee Lee³
Video for model
training.
A
Video for model
training
Make your
anchor!
System overview
Motion condition.
Motion condition
CVPR
SEATTLE, WA JUNE 17-21, 2024
User study scores
App Temporal
Results
Quantitative results
AN
Method
SO
Peeraine CNY
heration Quin!
**
Generated anchor videos,
A
P
230
DURAS
199
M
TPS
210
118
"
476
Dreamse
194
227
IN
租
DisCo145
2.32
LAS
IN
Ours
335
391
Our method achieves better performance
on image quality, temporal consistency.
and structure preservation.
The rating score is on a scale from one to
five, where five is the highest score, and
one is the lowest
Generated ancher videos
Appearanc
& Motion
Setting: we formulate 2D avatar generation as a learning paradigm from a video of one identity, where
a personalized diffusion model that could generate a human video in the same scenario as the input
video.
Input: we utilize human 3D mesh rendered from SMPL-X parameters as our input poses, which is
smoothness and structural preservation, especially in hand gestures.
Face SGDM
Mesh
Batch-overlapped Temporal Denoising
Batch
1734
Identity-Specific Face Enhancement
Ed
Vole
Oun
DiCe
Dream
TPS
Posting
Qualitative results compared with other methods
Our methods achieve accurate gestures and high-quality generation with facial details.
0
Balch
Face Make
Loopery....
➤ Frame-wise Motion-to-Appearance Diffusing:
We train our model frame-wise to bind
appearance and motion.
A structure-guided diffusion model (SGDM) to
generate human image under the control of 3D
mesh conditions frame-by-frame.
Pre-training on multiple identities for motion
generation and fine-tuning on a single identity to
bind the movements to the appearance.
Init Noise Sequence 27
➤Batch-overlapped Temporal Denoising:
We inference video-wise to get consistent and arbitrary
length of videos.
Extends the 2D U-Net to a 3D U-Net video diffusion
model by training-free all-frame cross-attention module.
An effective and simple algorithm operated on multi-batch
noise to generate an arbitrary length of anchor video:
➤Identity-Specific Face
Enhancement:
We further revise and enhance
the face region.
An inpainting-based approach
with crop-and-blend operations
Cross-person motion results (Left is pose, and right is output)
When the cross-motion is in a similar style, the generated results are in good quality
from the holistic generated body.
Full-body results
Audio-driven results
GitHub