The image shows a detailed research poster titled "Language-conditioned Detection Transformer," presented at the CVPR conference held from June 17-21, 2024, in Seattle, Washington. The poster is authored by Jang Hyun Cho and Philipp Krähenbühl from UT Austin.

The poster illustrates a system named DECOLA (presumably "DEtection COnditioned by LAnguage"), designed to improve object detection in images using language-based queries. Key components of the system are depicted through diagrams and flowcharts, explaining the processes of language-conditioned query selection and training for both conditioned prediction and open-vocabulary detection.

Illustrative images featuring a cat near a Coca-Cola bottle are used to demonstrate different stages of the detection process. The process involves using a text encoder to convert language queries ("a cat," "a cola," etc.) into a format that the system can use to identify objects in the image. DECOLA leverages similarity scores and self-attention mechanisms to optimize detection accuracy.

Graphs and charts at the bottom of the poster present quantitative results, comparing the performance of DECOLA against baseline methods across different metrics and stages of training iterations. A QR code is visible, likely providing a link to additional resources or the full research paper.
Text transcribed from the image:
CVPR
JUNE 17-21, 2024
SEATTLE, WA
uage-conditioned Detection
Language-conditioned Detection Transformer
Jang Hyun Cho and Philipp Krähenbühl
UT Austin
DECOLA Overview
Language-conditioned Query Selection
set of size N
d
or
men
X N
image-text dataset of size N
text encoder
DECOLA
conditioned
Image-level tags:
"cat", "cola", "mentos"
gmax
select scores
論
eudo-labels
× N
ment
Language-conditioned Detections
argmax
en
training data with pseudo-labels
OXFO
X N
"a cat",
Present classes "a table",
"a cola"
Text Encoder
Open
Wcat
Wtable
W cola
Similarity Score
cat
P cat ER1
Weat
Efficient
cola
Image Encoder
P cola ER1
Decoder
W cola
Open-voc
table
P table ER¹
Naive Self-attention O(Q²K²)
DECOLA
Query Selection
Wtable
Single-class Classifier
× N
TL;DR
learns language-conditioned
to optimize the precision of queried
Phase 1)
weakly
AP unseen
34.3
35
30.6
28.8
25 23.924.5
23.6
18.6
Phase 1 Training: DECOLA for conditioned prediction
"an object"
Text Encoder
Wobject
Similarity Score
Image Encoder
Decoder
Query Selection
Text Encoder
-
qobject
"a cat",
"a table",
"a cola",
All classes (N)
Isolated Self-attention
by Masking
O(Q²x²)
Pall ЄRN
Phase 2 Training: DECOLA for open-vocabulary detection
Memory-efficient o(KQ²)
Multi-class Classifier
Isolated Self-attention
Efficient Self-attention
17.6
ONE
47.7
37.2
50
baseline
75
online pseudo-labeling
DECOLA Phase 1 (offline)
36.8
44.0
41.9
40.0
OV-DETR
DECOLA
C-AP unseen
33.1
35 32.0
27.5
21.9
C-AR unseen
61.1
55
50.8
C-AP unseen@20
33.2
29.8
27.7
20 16.1
45
OO
DECOLA Phase 1
DECOLA Phase 1
baseline
37.8 38.6
0%
2%
10%
100%
none
35
300
1st stage (prop)
2nd stage (pred)
training iterations (%)
a