Researchers presenting their work on "Correcting Diffusion Generation through Resampling" at a conference. Their poster, prominently featuring the MIT logo, explores methods to improve pre-trained diffusion models by addressing the distribution gap between generated and real images. Key components include particle filtering frameworks and text-to-image generation techniques. The poster details sample selection methodologies, problem formulation, and experimental results with illustrated charts and images. The CVPR 2024 conference logo indicates the event and location in Seattle, WA. Attendees are engaged in a detailed discussion about the research findings, highlighting its significance in the field. Text transcribed from the image: MIT-IBM Watson Al Lab >Open Question קווין How to improve pre-trained diffusion models post hoc with marginal additional compute? ➤Distribution Gap: Generated vs. Real Images Missing object errors: Cat on chair peering over top of table at glass of beverage. + Low image quality: Two donuts, banana, cup and a book on the table. Highlight Correcting Diffusion Generation through Resampling Yujian Liu, Yang Zhang, Tommi Jaakkola³, Shiyu Chang' UC Santa Barbara, 2MIT-IBM Watson AI Lab, Massachusetts Institute of Technology ➤Initial Exploration: Naive Sample Selection ObjectSelect algorithm Step 1: Generate K images from diffusion model Step 2: Select the one with the best object occurrence Observation: ObjectSelect is competitive in reducing missing object errors Indication: Sampling on multiple generation paths is effective on modifying diffusion generation distribution ➤A Particle Filtering Framework Q: Can we design a more effective sampling algorithm to address both errors? Denoise Step {x(*)} Resample Фе/фе+1 " Problem Formulation Diffusion denoising process: T-1 q(Xor C) = q(x) q (X|X+1, C) Ground-truth distribution: p(XC) + Goal: reduce gap between q(XIC) and p(XIC) with two types of external guidance: • An object detector Sample q(XX+1 = x(C) : Resample from the set UK) K times (with replacement) with a probability proportional to ((10) +1(C) Results: {x} follow the distribution q(XC)(XC) Indication: Setting (X,|C)= P(XIC) ➤ Text-to-Image Generation 315 310 × 260 GPT-Synthetic 은 24.5 24.0 23.5 23.0 22.1 0.55 0.60 x Plain SD x Spacial-Temporal x D-Guidance x Attend-Excite 0.65 Object occurrence 315 X 31.0 26.5 X 26.0 25.5 25.0 24.5 24.0 23.5 23.0 22.5 0.65 TIFA Select Reward Select. MS-COCO 0.75 CVPR JUNE 17-21, 2024 SEATTLE, WA Paper Code Figure 2: Sample images by our method. Object Occurrence (%) † FID GPT-Syn MS-COCO 72.96 83.84 24.03 67.16 80.49 24.18 75.67 85.79 25.77 0.80 0.85 PF-HYBRID PF PF-Hybrid -Discriminator Object occurrence Object Select PF-Discriminator Figure 1: FID (4) vs. Object occurrence (1). FID is measured on MS-COCO for both figures. K=5,10,15 images are generated for sample selection methods, and the sizes of points indicate the value of K. >Unconditional & Class-conditioned Generation Table 1: Ablation study on effects of the particle filtering algorithm and particle weights design. 18 16 10 1.4 4(XC) can approach p(XIC) 1.2 A discriminator-based approach P(XIC) d(X,|C; t) A small set of real images q(XC) 1-d(X|C; t) Notation: object mention Oc A hybrid approach if object i is mentioned in C otherwise O ImageNet-64 FFHQ 2.8 2.7 2.6 2.5 문 2.4 2.3 2.2 2.1 2.0 100 200 300 400 500 600 Effective NFE 100 200 300 400 Original Sampler D-Guidance D-Select Effective NFE Particle Filtering ImageNet-64 12 " Method D-Select Particle Fening 10 200 300 500 600 P(XC) P(XC.Oc)_q(C,Oc) p(x) p(Ocxt) p(COX) (XIC) (XC,Oc) p(C,Oc) q(x) q(OX) q(COX) Discriminator Object detector Figure 3: FID (4) when evaluated with Restart sampler. The x-axis indicates the effective NFE, which considers all compute costs including the forward and backward passes of the discriminator. Figure 4: Ablation: FID (4) for D-Select and our PF method when different NFEs are used for each image. [1] Dongjun Kim et al. "Refining generative process with discriminator guidance in score-based diffusion models." 2023. [2] Shyamgopal Karthik et al. "If at first you don't succeed, try, try again: Faithful diffusion-based text-to-image generation by selection." 2023. [3] Hila Chefer et al. "Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models." 2023. [4] Qiucheng Wu et al. "Harnessing the spatial-temporal attention of diffusion models for high-fidelity text-to-image synthesis." 2023 Contact: yujianliu, chang87)@ucsb.edu, yang.zhang2@ibm.com, tommi@csail.mit.edu We 379