This image captures a research presentation at the CVPR 2024 conference in Seattle, WA. The poster is titled "Correcting Diffusion Generation through Resampling" and is presented by researchers from MIT-IBM Watson AI Lab, Massachusetts Institute of Technology, and UC Santa Barbara. The poster, marked as a "Highlight" presentation, explores methods to improve pre-trained diffusion models through resampling techniques. The detailed content includes sections on initial exploration with naive sample selection, a particle filtering framework, and results visualization for both text-to-image generation and unconditional/class-conditioned generation. The researchers focus on addressing discrepancies between generated and real images and optimizing sample selection for enhanced diffusion model performance. Graphs, figures, and equations support the findings, and the work acknowledges the collaboration of several contributors. Two individuals are closely examining the poster, engaging with the presented material. Text transcribed from the image: MIT-IBM Watson Al Lab >Open Question קווין How to improve pre-trained diffusion models post hoc with marginal additional compute? ➤Distribution Gap: Generated vs. Real Images Missing object errors: Cat on chair peering over top of table at glass of beverage. + Low image quality: Two donuts, banana, cup and a book on the table. Highlight Correcting Diffusion Generation through Resampling Yujian Liu, Yang Zhang, Tommi Jaakkola³, Shiyu Chang' UC Santa Barbara, 2MIT-IBM Watson AI Lab, Massachusetts Institute of Technology ➤Initial Exploration: Naive Sample Selection ObjectSelect algorithm Step 1: Generate K images from diffusion model Step 2: Select the one with the best object occurrence Observation: ObjectSelect is competitive in reducing missing object errors Indication: Sampling on multiple generation paths is effective on modifying diffusion generation distribution ➤A Particle Filtering Framework Q: Can we design a more effective sampling algorithm to address both errors? Denoise Step {x(*)} Resample Фе/фе+1 " Problem Formulation Diffusion denoising process: T-1 q(Xor C) = q(x) q (X|X+1, C) Ground-truth distribution: p(XC) + Goal: reduce gap between q(XIC) and p(XIC) with two types of external guidance: • An object detector Sample q(XX+1 = x(C) : Resample from the set UK) K times (with replacement) with a probability proportional to ((10) +1(C) Results: {x} follow the distribution q(XC)(XC) Indication: Setting (X,|C)= P(XIC) ➤ Text-to-Image Generation 315 310 × 260 GPT-Synthetic 은 24.5 24.0 23.5 23.0 22.1 0.55 0.60 x Plain SD x Spacial-Temporal x D-Guidance x Attend-Excite 0.65 Object occurrence 315 X 31.0 26.5 X 26.0 25.5 25.0 24.5 24.0 23.5 23.0 22.5 0.65 TIFA Select Reward Select. MS-COCO 0.75 CVPR JUNE 17-21, 2024 SEATTLE, WA Paper Code Figure 2: Sample images by our method. Object Occurrence (%) † FID GPT-Syn MS-COCO 72.96 83.84 24.03 67.16 80.49 24.18 75.67 85.79 25.77 0.80 0.85 PF-HYBRID PF PF-Hybrid -Discriminator Object occurrence Object Select PF-Discriminator Figure 1: FID (4) vs. Object occurrence (1). FID is measured on MS-COCO for both figures. K=5,10,15 images are generated for sample selection methods, and the sizes of points indicate the value of K. >Unconditional & Class-conditioned Generation Table 1: Ablation study on effects of the particle filtering algorithm and particle weights design. 18 16 10 1.4 4(XC) can approach p(XIC) 1.2 A discriminator-based approach P(XIC) d(X,|C; t) A small set of real images q(XC) 1-d(X|C; t) Notation: object mention Oc A hybrid approach if object i is mentioned in C otherwise O ImageNet-64 FFHQ 2.8 2.7 2.6 2.5 문 2.4 2.3 2.2 2.1 2.0 100 200 300 400 500 600 Effective NFE 100 200 300 400 Original Sampler D-Guidance D-Select Effective NFE Particle Filtering ImageNet-64 12 " Method D-Select Particle Fening 10 200 300 500 600 P(XC) P(XC.Oc)_q(C,Oc) p(x) p(Ocxt) p(COX) (XIC) (XC,Oc) p(C,Oc) q(x) q(OX) q(COX) Discriminator Object detector Figure 3: FID (4) when evaluated with Restart sampler. The x-axis indicates the effective NFE, which considers all compute costs including the forward and backward passes of the discriminator. Figure 4: Ablation: FID (4) for D-Select and our PF method when different NFEs are used for each image. [1] Dongjun Kim et al. "Refining generative process with discriminator guidance in score-based diffusion models." 2023. [2] Shyamgopal Karthik et al. "If at first you don't succeed, try, try again: Faithful diffusion-based text-to-image generation by selection." 2023. [3] Hila Chefer et al. "Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models." 2023. [4] Qiucheng Wu et al. "Harnessing the spatial-temporal attention of diffusion models for high-fidelity text-to-image synthesis." 2023 Contact: yujianliu, chang87)@ucsb.edu, yang.zhang2@ibm.com, tommi@csail.mit.edu We 379