Researchers presenting their work on "Correcting Diffusion Generation through Resampling" at a conference. Their poster, prominently featuring the MIT logo, explores methods to improve pre-trained diffusion models by addressing the distribution gap between generated and real images. Key components include particle filtering frameworks and text-to-image generation techniques. The poster details sample selection methodologies, problem formulation, and experimental results with illustrated charts and images. The CVPR 2024 conference logo indicates the event and location in Seattle, WA. Attendees are engaged in a detailed discussion about the research findings, highlighting its significance in the field.
Text transcribed from the image:
MIT-IBM
Watson
Al Lab
>Open Question
קווין
How to improve pre-trained diffusion models
post hoc with marginal additional compute?
➤Distribution Gap: Generated vs. Real
Images
Missing object errors:
Cat on chair peering over top
of table at glass of beverage.
+ Low image quality:
Two donuts, banana, cup and
a book on the table.
Highlight
Correcting Diffusion Generation through Resampling
Yujian Liu, Yang Zhang, Tommi Jaakkola³, Shiyu Chang'
UC Santa Barbara, 2MIT-IBM Watson AI Lab, Massachusetts Institute of Technology
➤Initial Exploration: Naive Sample Selection
ObjectSelect algorithm
Step 1: Generate K images from diffusion model
Step 2: Select the one with the best object occurrence
Observation: ObjectSelect is competitive in reducing missing object errors
Indication: Sampling on multiple generation paths is effective on
modifying diffusion generation distribution
➤A Particle Filtering Framework
Q: Can we design a more effective sampling algorithm to address both errors?
Denoise
Step
{x(*)}
Resample
Фе/фе+1
"
Problem Formulation
Diffusion denoising process:
T-1
q(Xor C) = q(x) q (X|X+1, C)
Ground-truth distribution: p(XC)
+ Goal: reduce gap between q(XIC) and p(XIC)
with two types of external guidance:
• An object detector
Sample
q(XX+1 = x(C)
: Resample from the set UK) K times (with replacement) with a
probability proportional to
((10)
+1(C)
Results: {x} follow the distribution q(XC)(XC)
Indication: Setting (X,|C)= P(XIC)
➤ Text-to-Image Generation
315
310
×
260
GPT-Synthetic
은 24.5
24.0
23.5
23.0
22.1
0.55
0.60
x Plain SD
x Spacial-Temporal
x D-Guidance
x Attend-Excite
0.65
Object occurrence
315
X
31.0
26.5
X
26.0
25.5
25.0
24.5
24.0
23.5
23.0
22.5
0.65
TIFA Select
Reward Select.
MS-COCO
0.75
CVPR
JUNE 17-21, 2024
SEATTLE, WA
Paper
Code
Figure 2: Sample images by
our method.
Object Occurrence (%) †
FID
GPT-Syn MS-COCO
72.96
83.84
24.03
67.16
80.49
24.18
75.67
85.79
25.77
0.80
0.85
PF-HYBRID
PF
PF-Hybrid
-Discriminator
Object occurrence
Object Select
PF-Discriminator
Figure 1: FID (4) vs. Object occurrence (1). FID is measured on MS-COCO
for both figures. K=5,10,15 images are generated for sample selection
methods, and the sizes of points indicate the value of K.
>Unconditional & Class-conditioned Generation
Table 1: Ablation study on effects
of the particle filtering algorithm
and particle weights design.
18
16
10
1.4
4(XC) can approach p(XIC)
1.2
A discriminator-based approach
P(XIC) d(X,|C; t)
A small set of real images
q(XC) 1-d(X|C; t)
Notation: object mention Oc
A hybrid approach
if object i is mentioned in C
otherwise
O
ImageNet-64
FFHQ
2.8
2.7
2.6
2.5
문 2.4
2.3
2.2
2.1
2.0
100 200
300 400 500 600
Effective NFE
100
200
300
400
Original Sampler
D-Guidance
D-Select
Effective NFE
Particle Filtering
ImageNet-64
12
"
Method
D-Select
Particle Fening
10
200
300
500
600
P(XC) P(XC.Oc)_q(C,Oc) p(x) p(Ocxt) p(COX)
(XIC) (XC,Oc) p(C,Oc) q(x) q(OX) q(COX)
Discriminator
Object detector
Figure 3: FID (4) when evaluated with Restart sampler. The x-axis indicates
the effective NFE, which considers all compute costs including the forward
and backward passes of the discriminator.
Figure 4: Ablation: FID (4) for
D-Select and our PF method
when different NFEs are used
for each image.
[1] Dongjun Kim et al. "Refining generative process with discriminator guidance in score-based diffusion models." 2023.
[2] Shyamgopal Karthik et al. "If at first you don't succeed, try, try again: Faithful diffusion-based text-to-image generation by selection." 2023.
[3] Hila Chefer et al. "Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models." 2023.
[4] Qiucheng Wu et al. "Harnessing the spatial-temporal attention of diffusion models for high-fidelity text-to-image synthesis." 2023
Contact: yujianliu, chang87)@ucsb.edu, yang.zhang2@ibm.com, tommi@csail.mit.edu
We
379