A large poster titled "PixelLM: Pixel Reasoning with Large Multimodal Model" is prominently displayed at a conference booth, numbered 215 and 216. The research appears to be conducted by a team of researchers, showcasing their affiliations with WEI Lab @ BJTU and ByteDance, with logos of the organizations and CVPR 2023 (Conference on Computer Vision and Pattern Recognition) held in Seattle, WA, from June 17-19, 2023, clearly visible. The poster is divided into sections including "Introduction," "Method," "Dataset Detail," "Visualization," and "Experimental Results." The introduction delves into the new model, PixelLM, for pixel-level reasoning and understanding. The method section includes diagrammatic representations of the process, while the dataset detail provides insights into the components and their descriptions. The visualization section exhibits various images and result scenarios such as segmentation, multi-target queries, instance-level segmentation, and conversational contexts, highlighting the model's extensive application capabilities. The experimental results comprise tables displaying benchmark scores, highlighting the performance of the PixelLM model across different tests. A QR code for accessing the homepage of the research project is also present on the poster. In the foreground, a person is visible standing behind the poster, contributing to the bustling atmosphere typical of academic conferences. The scene exemplifies an engaging and informative presentation setup, integral to such scholarly events. Text transcribed from the image: 215 The CVPR ON SEATTLE, WA JUNE 17-21, 2024 216 PixelLM: Pixel Reasoning with Large Multimodal Model Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, Xiaojie Jint Introduction We present PixelLM, a novel LMM for pixel-level reasoning and understanding, capable of handling multiple targets and diverse reasoning complexities without relying on costly segmentation models. . We construct MUSE, a high-quality multi-target reasoning segmentation dataset. Utilizing a GPT-4V- aided data curation pipeline, it has 246k question- answer pairs, covering 0.9 million instances. PixelLM achieves new state-of-the-art results across a spectrum of benchmarks. Dataset Detail MUSE stands out with its open-set concepts, detailed object descriptions, complex multi-target question- answer pairs, and instance-level mask annotations. Earphone A pair of over the ear headphones rests next to the cat A bed covered in a light blue quilt occupies the majority of the scene Qul A crumpled light blue quilt almost completely covers the bed Cat A grey cat with a collar is lounging on a closed laptop Laptop compute A closed laptop is positioned towards the foot of the bed under a resting cat Question: How can I comfortably listen to music while petting my cat when I get home from a long day at work? Answer: You can lie down comfortably on the large bed covered with soft quilts Then take the liver laptop out from under the chubby furry cat next to you and connect it to the black wired headphones next to you to listen to music Method PixelLM fimg Sit in the kayak CO, propel it forward using the paddle Large Language Model Light-weight decoder. fing CLIP encoder 0000-0000 Text embeddings How can I get to the sea far away from the coast? 00-0 Seg codebook Trainable Frozen Visualization Decoder Attn Li Fusion fing ---------- Attn L-1 updatefimg XL-1 Weighted sum Visualization results in following scenarios: 1. Multi-target reasoning segmentation; 2. Instance-level segmentation tied with text description; 3. Multi-referring segmentation; 4. Conversation Multi-target reasoning segmentation USER: if someone wants to windsurf, what tools would they require from the image ASSISTANT: A that is essential for catching wind and a board that is used for gliding over the water EG and SEG USER: Which items would a person use when he sides down the mountain while keeping his hand warm ASSISTANT: He uses to glide on the snow SEC ow while wearing two USER: How can i comfortably enjoy tv time? ASSISTANT: You can sit on the spacious sofa SEC and use the remote control SEC to easily navigate through channels or adjust the volume USER want to have a refreshing breakfast, how to do? ASSISTANT YO coffee SEC You can drink a cup of E use the knife-sto the sandwich SEG Instance-level segmentation tied with text description USER: What objects are in the image! Please segment them separately ASSISTANT: They are & brushs , a device near the brush, likely a flashlight, is , a slim hairpin is da blue paper tape is SEC and a pens PixelLM Multi-referring segmentation USER Segment the bed, mirror and book shelf ASSISTANT: The bed, mirror and book shelflare SEG, SEC> and , separately USER: Segment each of the zebras head exposed ts , the zebra in the middles ASSISTANT: The zebra on the far left with only its and the zebra standing on the right s Conversation USER: What the relationship of these two animal ASSISTANT: The relationship between the calf and the cow in the image is that of a mother and her offspring. The cow is nursing her calf, which Indicates that they share a stron cow is providing nourishment to her young one. strong bond and the USER: Segment them, ASSISTANT: The calf sucking milk and the cow providing milk to the calfare and . BELING HAOTONG WEI Lab@BJTU Il Byte Dance UNIVERSITY F SCIENCE 鼎新 1952-AND TECHNOLO Experimental Results Home page The above table displays results on the MUSE benchmark, while the table below shows results on the referring segmentation benchmark. PisellM-78 Pisell M-78 Method Val Test Method TFLOPS SAM few targets many targets overall glou clou gloU clot gloU clou glou clou SEEM [43] ✓ 0.43 136 24.9 85 13.2 117 15.7 LISA-7B (14) x 7.16 188 29.0 24.7 36.5 9.6 24.5 12.8 27.1 LISA-7B x 7.16 24.5 31.1 30.0 30.9 12.4 23.2 16.2 24.8 LISA-7B x 7.16 42.0 46.1 52.0 37.7 42.3 38.9 3.57 39.9 48.0 43.1 56.7 36.0 37.5 42.2 3.57 42.6 44.6 59.2 37.7 42.8 39.2 LISA-Llama2-136 [14] 10.24 20.4 29.2 LISA-Llama2-138 38.5 10.9 25.6 14.4 28.4 10.24 43.6 502 44.7 60.0 41.2 Pisell M-Llama2-138 Pisell M-Llama2-13B 479 41.9 50.5 6.65 43.0 51.7 44.8 39.3 44.6 40.5 6.65 44.8 54.1 45.2 415 47.6 42.3 $1.0 w/o refCOCO refCOCO+ SAM val refCOCOg testA testB MCN [22] testA testB val(U) test(U) 62.4 64.2 59.7 50.6 55.0 44.7 49.2 49.4 67.5 70.5 65.2 70.5 73.2 56.3 61.0 50.1 55.0 57.7 66.1 62.3 68.1 53.7 59.9 60.4 72.7 75.8 73.8 76.5 68.8 62.1 68.4 55.1 61.2 62.1 70.2 X-Decoder [42] SEEM [43] 66.0 71.0 57.7 65.0 66.0 64.6 65.7 74.1 76.5 74.0 76.3 71.1 62.4 67.4 56.5 66.4 68.5 70.4 62.5 56.0 67.0 69.1 68.2 66.3 71.7 58.3 69.3 70.5 VLT [9] CRIS [33] LAVT [34] ReLA [16] LISA [14] LISA PixelLM 73.0 76.5