This image displays an academic poster titled "PixelLM: Pixel Reasoning with Large Multimodal Model" by Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. The poster presents PixelLM, a novel language multimodal model (LMM) for pixel-level reasoning and understanding, capable of handling multiple targets and diverse reasoning complexities without relying on costly segmentation models. The introduction highlights key points, including the construction of MUSE, a high-quality multi-targeted reasoning segmentation dataset. This dataset uses a GPT-4V data curation pipeline and contains 246k question-answer pairs, covering 0.9 million instances. PixelLM achieves new state-of-the-art results across a spectrum of benchmarks. The Method section illustrates the PixelLM model's architecture and functioning, while the Dataset Detail section elaborates on MUSE's unique aspects, such as open-set concepts and detailed object descriptions. The Visualization section shows results in various scenarios like multi-target reasoning segmentation and instance-level segmentation tied with text description. Experimental Results are summarized in tables, showcasing PixelLM's performance on the MUSE benchmark and referring segmentation benchmark. This comprehensive poster was displayed at the CVPR (Conference on Computer Vision and Pattern Recognition) event, evident by the CVPR logo and location indicators at the top right corner. The poster is supported by affiliations with WEI Lab @ BJTU and ByteDance. Text transcribed from the image: PixelLM: Pixel Reasoning with Large Multimodal Model Zhongwei Ren*, Zhicheng Huang*, Yunchao Wei*, Yao Zhao, Dongmei Fu, Jiashi Feng, Xiaojie Jin† Introduction • We present PixelLM, a novel LMM for pixel-level reasoning and understanding, capable of handling multiple targets and diverse reasoning complexities without relying on costly segmentation models. • We construct MUSE, a high-quality multi-target reasoning segmentation dataset. Utilizing a GPT-4V-aided data curation pipeline, it has 246k question-answer pairs, covering 0.9 million instances. • PixelLM achieves new state-of-the-art results across a spectrum of benchmarks. Dataset Detail MUSE stands out with its open-set concepts, detailed object descriptions, complex multi-target question-answer pairs, and instance-level mask annotations. Earphone: A pair of over-the-ear headphones rests next to the cat Bed: A bed covered in a light blue quilt occupies the majority of the scene Quilt: A crumpled light blue quilt almost completely covers the bed Cat: A grey cat with a collar is lounging on a closed laptop on a bed Laptop computer: A closed laptop is positioned towards the foot of the bed under a resting cat. Question: How can I comfortably listen to music while petting my cat when I get home from a long day at work? Answer: You can lie down comfortably on the large bed covered with a quilt. You can take the silver laptop out from under the chubby, furry cat, and use the nearby pair of larger over-the-ear wired headphones to listen to some music Method PixelLM Visualization Visualization results in following scenarios: 1. Multi-target reasoning segmentation; 2. Instance-level segmentation tied with text description; 3. Multi-referring aided segmentation; 4. Conversation Large Language Model Propel it forward using the paddle Light-up decoder Decoder Text embeddings Seg. codebook CLIP How can I get to the sea far away from the coast? Trainable Frozen Image Mid-left sub Experimental Results The above table displays results on the MUSE benchmark, while the below table shows results on the referring segmentation benchmark. WEILab@BTU ByteDance