The image shows a detailed research poster titled "Language-conditioned Detection Transformer," presented at the CVPR conference held from June 17-21, 2024, in Seattle, Washington. The poster is authored by Jang Hyun Cho and Philipp Krähenbühl from UT Austin. The poster illustrates a system named DECOLA (presumably "DEtection COnditioned by LAnguage"), designed to improve object detection in images using language-based queries. Key components of the system are depicted through diagrams and flowcharts, explaining the processes of language-conditioned query selection and training for both conditioned prediction and open-vocabulary detection. Illustrative images featuring a cat near a Coca-Cola bottle are used to demonstrate different stages of the detection process. The process involves using a text encoder to convert language queries ("a cat," "a cola," etc.) into a format that the system can use to identify objects in the image. DECOLA leverages similarity scores and self-attention mechanisms to optimize detection accuracy. Graphs and charts at the bottom of the poster present quantitative results, comparing the performance of DECOLA against baseline methods across different metrics and stages of training iterations. A QR code is visible, likely providing a link to additional resources or the full research paper. Text transcribed from the image: CVPR JUNE 17-21, 2024 SEATTLE, WA uage-conditioned Detection Language-conditioned Detection Transformer Jang Hyun Cho and Philipp Krähenbühl UT Austin DECOLA Overview Language-conditioned Query Selection set of size N d or men X N image-text dataset of size N text encoder DECOLA conditioned Image-level tags: "cat", "cola", "mentos" gmax select scores 論 eudo-labels × N ment Language-conditioned Detections argmax en training data with pseudo-labels OXFO X N "a cat", Present classes "a table", "a cola" Text Encoder Open Wcat Wtable W cola Similarity Score cat P cat ER1 Weat Efficient cola Image Encoder P cola ER1 Decoder W cola Open-voc table P table ER¹ Naive Self-attention O(Q²K²) DECOLA Query Selection Wtable Single-class Classifier × N TL;DR learns language-conditioned to optimize the precision of queried Phase 1) weakly AP unseen 34.3 35 30.6 28.8 25 23.924.5 23.6 18.6 Phase 1 Training: DECOLA for conditioned prediction "an object" Text Encoder Wobject Similarity Score Image Encoder Decoder Query Selection Text Encoder - qobject "a cat", "a table", "a cola", All classes (N) Isolated Self-attention by Masking O(Q²x²) Pall ЄRN Phase 2 Training: DECOLA for open-vocabulary detection Memory-efficient o(KQ²) Multi-class Classifier Isolated Self-attention Efficient Self-attention 17.6 ONE 47.7 37.2 50 baseline 75 online pseudo-labeling DECOLA Phase 1 (offline) 36.8 44.0 41.9 40.0 OV-DETR DECOLA C-AP unseen 33.1 35 32.0 27.5 21.9 C-AR unseen 61.1 55 50.8 C-AP unseen@20 33.2 29.8 27.7 20 16.1 45 OO DECOLA Phase 1 DECOLA Phase 1 baseline 37.8 38.6 0% 2% 10% 100% none 35 300 1st stage (prop) 2nd stage (pred) training iterations (%) a