A researcher engages in a discussion at a conference, standing in front of a detailed poster presentation titled "ViewDiff: 3D-Consistent Image Generation." The poster, associated with the Technical University of Munich and Meta, highlights advancements in generating high-quality, multi-view consistent images from text descriptions. Diagrams and sample images of 3D objects like toy bears and fire hydrants illustrate the research findings. The attendee wears a light green shirt and an orange Voxels 51 lanyard with a conference badge. The event is taking place at CVPR 2024 in Seattle, WA, emphasizing cutting-edge developments in image reconstruction and artificial intelligence. Text transcribed from the image: CVPR JUNE 17-21, 2024 TUTI CO Introduction Te ViewDiff: 3D-Consistent Image Generation w Lukas Höllein12, Aljaž Božič², Norman Müller, David Christian Richardt?, Michael Zollhöfer, Matthia "work done during Lukas' internship Technical University of Munich, Meta Generated 3D-Consistent Images From Text preces high-quality, multi-view consistent images of a real-world in authentic surroundings a teddy bear sitting on the ground in the dark a red and white fire hydrant on a brick floor Finetune With Multi-View Supervision e sum pretrained text-to-image models into 30 consistent image generators theuning them with multi-view supervision (CO3D dataset) augment the U-Nel architecture with new layers in every U-Net block a glazed donut sitting on a marble counter a red apple on a blue and white patterned Optimized 3D Object (N EL51 VOXEL51 ale Image Reconstruction Real Image Ours SEATTLE, WA Given a single p and image output poses as we de create multi consistent images c ame object in a = oising forward p Apple PSNR SSIM↑ 19.54 0.64 5.79 0.91 5.94 0.91 First, we replace self attention with cross fame attention (yellow conditioned on pose RT) taics (K) and intensity (0) Second, we add a projection layer (green) 4**93 NeRF optimization from 100 gener into the inner blocks of the U-Net. It creates a 30 representation from multi-view features and renders them into 30-consistent features ✰ t op CVPR SEATTLE, WA Lukas Lukas Hoellein Technical University of O Munich EVPR282 create a nages с D obje ng the pass ge in a shion. Attent