In the image, a large, multi-piece poster is displayed on the wall, showcasing an array of complex images and diagrams. The poster appears to convey information about various subjects, possibly related to biological sciences or research. The posters are lit by rows of lights, which illuminate the details of the pictures on the wall. The presentation seems to be targeting students, as it is displayed at a meeting with the subject "In Class" and features diagrams and pictures. Text transcribed from the image: ING UNIVE 1898 BRSITY 地平线 Horizon Robotics Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation Abstract Task: Fine-grained and Multi-instruction Image Editing. ➤ Problem: As seen in figures below, InstructPix2Pix, a typical instruction- based image editing method, tends to over-edit and is severely degraded when editing with multiple instructions. This prevents more fine-grained image editing and more complex image editing. lastrac เคยค S "Give him a black elegant hat black elegant Datangle sample Vanilla IP2P sample Concatenation What if she were in an And para pair of s laput instruction hat CVPR Qin Guo, Tianwei Lin videos: https://youtu.be/rPknqOJsxkg Experimental Results -based mask extraction and process (Sec 4.1) Illustration of disentangle sample (Sec 4.3) Mary Cress Attention COOL Vanilla 172 O (7-1)-(7-5-1) (T-5-1)-0 *tist week IP2P vs IP2P + Fol Cross-attention maps with increasing denoising steps Cross attention maps obtained from IP2P ➤ When a person wants to edit an image: The edit area is found first, and fine-grained edits are performed. At the same time, when multiple edits coexist, people can implicitly ensure the harmony of multi-instruction edits. Our solution ➤ Mimicking the human process of editing an image, we use the hidden grounding capability of IP2P (shown in Fig 3) to first find the editing area, And the editing effect is restricted to the corresponding region by cross attention modulation and disentangle sample. "clock" "flower" "Eifficl" "grass" "pray" Input image Steps: T-(7-1) Before Attention Medulation Attention Modulation What Framework of Fol. • Then, We restrict the editing effect to the corresponding region by cross-attention modulation: A't,ins softmax webpa We check the cross attention map of InstrcutPix2Pix in the first inference step and find that it has a strong grounding ability for ranking, adjectives, and verbs. ➤ First, we need to find the edit area in first denoise step: A[e] = norm (norm (... norm (A₁(e; 2) ...)²)² (2) y times of Our (x+AX)OM+yo(1-M) √d Alliga Single-Instruction Multi-Instruction lastrection Image Instruction Alliga Image Align Align Diffedit [10] 9.42% 10.42% 0.75% 308% D (3) NT1-P2P [32] 9.5% 16.589 042% 4.25% Here, d represents the latent projection dimension. The terms are defined as follows: P2P (7) 12.83% MagicBrush [65] InDiff [14] 10.929 23.179 22.83% 3.08% 1.92% 9.42% 4.75% 21.920 11.75% 5.30% 2679 Fol (ours) 23.17% 27.5% SKS X-2 X=QrKfr (4) Human preference study y= Q1, K AX= a(t) (max(QTKT)-Q1rKr) (6) (5) Method CLIP Disol CLIND PickScore Duffled 08627 0.7916 00644 00639 NTI-P2P 08522 07928 00981 00951 ➤ Finally, we use disentangle sample, at the latent level, to limit the total editing effect to the total editing area Single- 020 allo Instruction Magicllah ja] 09178 08702 01934 0.1290 HD 4 05755 03612 0.264 01377 Fel (ours) 0942 0.9277 0.1699 8.3000 Maghrash Dulleda ORSIS 0239 00639 00616 NTI-P2P 03560 07526 00665 DRU o(z,t,I,T) =co (2₁, t, 01, 0T) 600 03/60 03600 no Multi Instruction Magicrash JS 0831 01543 +81 (co (ze, t, I, Ør) - €0 (z1, t, 01, 0)) +ST (co(t,t,I,T)-co(zt, t, I, 0r)) Munion (9) Diff1] Fel(s) 08439 07938 01785 0135 A0265 BOLEH 3685 Quantitative comparisons 22222 212