The image depicts a large poster board with specifications for a computer program. The poster board appears to have a lot of information and detail, with a cat sitting on it. There are several images of other things on the poster board, such as a dog, a house, and a car. The poster board is likely a visual representation of a computer program or technology product. Text transcribed from the image: 216 • • Introduction 2000 PixelLM: Pixel Reasoning with Large Multimodal Model Zhongwei Ren*, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, Xiaojie Jin† We present PixelLM, a novel LMM for pixel-level reasoning and understanding, capable of handling multiple targets and diverse reasoning complexities without relying on costly segmentation models. We construct MUSE, a high-quality multi-target reasoning segmentation dataset. Utilizing a GPT-4V- aided data curation pipeline, it has 246k question- answer pairs, covering 0.9 million instances. Pixel LM achieves new state-of-the-art results across a spectrum of benchmarks. Dataset Detail MUSE stands out with its open-set concepts, detailed object descriptions, complex multi-target question- answer pairs, and instance-level mask annotations. Earphone: A pair of over-the-ear headphones rests next to the cat Bed: A bed covered in a light blue quilt occupies the majority of the scene Quilt: A crumpled light blue quilt almost completely covers the bed Cat: A grey cat with a collar is lounging on a closed laptop Laptop computer: A closed laptop is positioned towards the foot of the bed under a resting cat. Question: How can I comfortably listen to music while petting my cat when I get home from a long day at work? Answer: You can lie down comfortably on the large bed covered with soft quilts. he silver laptop out from under the chubby furry cat next to you ack wired headphones next to you to listen to music PixelLM Method Sit in the kayak, propel it forward using the paddle fimg CLIP encoder Large Language Model 0000-0000 Text embeddings How can I get to the sea far away from the coast? 00-0 Decoder Fusion f fimg L Light-weight decoder Attn L ---- Attn L-1-updatef XL-1 Weighted sum Trainable Frozen Seg codebook Visualization Visualization results in following scenarios: 1. Multi-target reasoning segmentation; 2. Instance-level segmentation tied with text description; 3. Multi-referring segmentation; 4. Conversation Multi-target reasoning segmentation USER: If someone wants to windsurf, what tools would they require from the image? ASSISTANT: A sail that is essential for catching wind and a board that are and . is used for gliding over the water USER: Which items would a person I use when he slides down the mountain while keeping his hand warm? ASSISTANT: He uses skis to glide on the snow while wearing two gloves . USER: How can I comfortably enjoy tv time? ASSISTANT: You can sit on the spacious sofa and use the remote control to easily navigate through channels or adjust the volume USER: I want to have a refreshing breakfast, how to do? ASSISTANT: You can drink a cup of coffee , use the knife SEG to eat the sandwich-SEG> Instance-level segmentation tied with text description USER: What objects are in the image! Please USER: Segment each of the zebras segment them separately. JADTONG CVPR SEATTLE, WA JUNE 17-21, 2024 WEI Lab@BJTU ByteDance Experimental Results Home page The above table displays results on the MUSE benchmark, while the table below shows results on the referring segmentation benchmark. 245 211 300 39 124 212 32 24 42445 22 ALL 4 235 2 06 2 47 42 Metod SAV LON shd dc ph dep de pe d SUM 10 18 2214 331 LISA x 131 LISA X 136 LISA- x PEM 10 PM- 7 1ST 426 427 4 ASSISTANT: The zebra on the far left with only its USALG 494 x 105 204 292223345 129 256 14434 LISA-L x 1834 PM-38 PixelM-Llama-138 645 42 57 44 61 33 542 COCO wib Method SAM val A test val test tes MCN (22) VLT 191 CRIS (33) LAVT [34] ReLA [16] ASSISTANT: They are A brush is SEG, a device near head exposed is SEG, the zebra in the middes the brush, likely a flashlight, is (SEG), a slim hairpin is SEG and the zebra standing on the rights SEG, a blue paper tape is SEO and a pen is SEG PixelLM Multi-referring segmentation Conversation USER: What the relationship of these two animal ASSISTANT: The relationship between the calf and the cow in the image is that of a mother and her offspring. The cow is nursing her call, which indicates that they share a strong bond and the cow is providing nourishment to her young one USER: Segment them. ASSISTANT: The calf sucking milk and the com providing milk to the calfare SEG and SEG USER: Segment the bed, mirror and book shelf ASSISTANT: The bed, mirror and book shelfare SEG, SEC and SEG, separately X-Decoder (42) SEEM 1431 624 642 59.7 506 550 447 492 494 67.5 70.5 652 563 610 501 550 57.7 70.5 73.2 661 623 681 537 59.9 604 72.7 75.8 688 621 684 551 612 621 646- 73.8 76.5 702 660 710 577 650 660 LISA [14] LISA PinellM 664 685 624 674 565 74.1 76.5 71 74.0 76.3 704 62.5 663 560 670 691 73.0 76.5 68.2 663 78.7 583 63 705