The image depicts a large presentation panel with various images and technical references. The panel is placed on a black counter, and several people are standing near the counter, possibly evaluating the presentation. The images on the panel include a computer screen, a person, and various technical equipment. The presentation also touches on various aspects of computer vision and machine learning.
Text transcribed from the image:
215
The CVPR
ON SEATTLE, WA JUNE 17-21, 2024
216
PixelLM: Pixel Reasoning with Large Multimodal Model
Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, Xiaojie Jint
Introduction
We present PixelLM, a novel LMM for pixel-level
reasoning and understanding, capable of handling
multiple targets and diverse reasoning complexities
without relying on costly segmentation models.
. We construct MUSE, a high-quality multi-target
reasoning segmentation dataset. Utilizing a GPT-4V-
aided data curation pipeline, it has 246k question-
answer pairs, covering 0.9 million instances.
PixelLM achieves new state-of-the-art results across
a spectrum of benchmarks.
Dataset Detail
MUSE stands out with its open-set concepts, detailed
object descriptions, complex multi-target question-
answer pairs, and instance-level mask annotations.
Earphone A pair of over the ear
headphones rests next to the cat
A bed covered in a light blue quilt
occupies the majority of the scene
Qul A crumpled light blue quilt
almost completely covers the bed
Cat A grey cat with a collar is lounging
on a closed laptop
Laptop compute A closed laptop is positioned towards the foot of the bed
under a resting cat
Question: How can I comfortably listen to music while petting my cat when I get
home from a long day at work?
Answer: You can lie down comfortably on the large bed covered with soft quilts
Then take the liver laptop out from under the chubby furry cat next to you
and connect it to the black wired headphones next to you to listen to music
Method
PixelLM
fimg
Sit in the kayak
CO, propel it forward using the paddle
Large Language Model
Light-weight
decoder.
fing
CLIP
encoder
0000-0000
Text embeddings
How can I get to the sea
far away from the coast?
00-0
Seg codebook
Trainable
Frozen
Visualization
Decoder
Attn Li
Fusion
fing
----------
Attn L-1 updatefimg
XL-1
Weighted sum
Visualization results in following scenarios: 1. Multi-target reasoning segmentation;
2. Instance-level segmentation tied with text description; 3. Multi-referring
segmentation; 4. Conversation
Multi-target reasoning segmentation
USER: if someone wants to
windsurf, what tools would they
require from the image
ASSISTANT: A that is essential
for catching wind and a board that
is used for gliding over the water
EG and SEG
USER: Which items would a person
use when he sides down the
mountain while keeping his hand
warm
ASSISTANT: He uses to glide on
the snow SEC
ow <SEC> while wearing two
USER: How can i comfortably enjoy
tv time?
ASSISTANT: You can sit on the
spacious sofa SEC and use the
remote control SEC to easily
navigate through channels or adjust
the volume
USER want to have a refreshing
breakfast, how to do?
ASSISTANT YO
You can drink a cup of
coffee SEC
E use the knife-sto
the sandwich SEG
Instance-level segmentation tied with text description
USER: What objects are in the image! Please
segment them separately
ASSISTANT: They are & brushs <SEC>, a device near
the brush, likely a flashlight, is <SEC>, a slim hairpin is
da blue paper tape is SEC and a pens
PixelLM
Multi-referring segmentation
USER Segment the bed, mirror and book shelf
ASSISTANT: The bed, mirror and book shelflare
SEG, SEC> and <SEC>, separately
USER: Segment each of the zebras
head exposed ts <SEC>, the zebra in the middles
ASSISTANT: The zebra on the far left with only its
<SEC> and the zebra standing on the right s
<SEC>
Conversation
USER: What the relationship of these two animal
ASSISTANT: The relationship between the calf and
the cow in the image is that of a mother and her
offspring. The cow is nursing her calf, which
Indicates that they share a stron
cow is providing nourishment to her young one.
strong bond and the
USER: Segment them,
ASSISTANT: The calf sucking milk and the cow
providing milk to the calfare <SEC> and <SEC>.
BELING HAOTONG
WEI Lab@BJTU
Il Byte Dance
UNIVERSITY
F SCIENCE
鼎新
1952-AND TECHNOLO
Experimental Results
Home page
The above table displays results on the MUSE
benchmark, while the table below shows results
on the referring segmentation benchmark.
PisellM-78
Pisell M-78
Method
Val
Test
Method
TFLOPS
SAM
few targets
many targets
overall
glou clou
gloU
clot
gloU clou
glou clou
SEEM [43]
✓
0.43
136
24.9
85
13.2
117
15.7
LISA-7B (14)
x
7.16
188
29.0
24.7
36.5
9.6
24.5
12.8
27.1
LISA-7B
x
7.16
24.5
31.1 30.0
30.9
12.4 23.2
16.2 24.8
LISA-7B
x
7.16
42.0
46.1
52.0
37.7 42.3
38.9
3.57
39.9
48.0 43.1
56.7
36.0
37.5 42.2
3.57
42.6
44.6 59.2
37.7
42.8
39.2
LISA-Llama2-136 [14]
10.24
20.4
29.2
LISA-Llama2-138
38.5
10.9
25.6
14.4 28.4
10.24
43.6
502
44.7
60.0
41.2
Pisell M-Llama2-138
Pisell M-Llama2-13B
479 41.9
50.5
6.65
43.0
51.7
44.8
39.3
44.6
40.5
6.65
44.8
54.1
45.2
415
47.6 42.3 $1.0
w/o
refCOCO
refCOCO+
SAM val
refCOCOg
testA testB
testA testB val(U) test(U)
MCN [22]
62.4 64.2
59.7
50.6 55.0 44.7
49.2
49.4
67.5 70.5 65.2
70.5 73.2
56.3 61.0 50.1
55.0
57.7
72.7 75.8
73.8 76.5
66.1
68.8
62.3 68.1
53.7
59.9
60.4
62.1 68.4 55.1
61.2
62.1
70.2
X-Decoder [42]
SEEM [43]
66.0 71.0
57.7 65.0 66.0
64.6
65.7
74.1 76.5
74.0 76.3
71.1
62.4 67.4
56.5
66.4 68.5
70.4 62.5
56.0
73.0 76.5
67.0
69.1
68.2
66.3
71.7 58.3 69.3 70.5
VLT [9]
CRIS [33]
LAVT [34]
ReLA [16]
LISA [14]
LISA
PixelLM