"At an academic conference, a detailed research poster titled 'UniMODE: Unified Monocular 3D Object Detection' is displayed prominently. The poster, presented by authors Xiaoling Liu, Xiaogang Xu, Ser-Nam Lim, and Hengshuang Zhao, details a novel approach to 3D object detection using monocular vision. Key sections include a comprehensive visualization section showing sample outputs across various datasets such as ARKitScenes, KITTI, and SUN-RGBD. Another significant portion of the poster is devoted to discussing the framework of their approach, including the use of various neural network layers and loss functions. There is also a comparison of their method with existing detectors, showing competitive performance metrics. Additionally, the poster highlights the challenges of geometry distribution gaps and heterogeneous domain distributions. Attendees are seen engaging with the poster, potentially asking the presenters questions about their innovative work." Text transcribed from the image: Modeling as Stochastic for Text-Video Retrieval Experiet Reso Dog Ta 190 INN CVPR JUNE 17-21, 2024 SEATTLE WA Motivation UniMODE: Unified Monocular 3D Object Detection Zhuoling Li Xiaogang Xu SerNam Lim Hengshuang Zhao' HKU Domain Head ²CUHK 3ZJU Framework Domain Confidence (91-2) M Proposal Queries 4UCF 格眀 dhe 5 NET-VIRTUR SAPIENTIA Proposal Head Proposal Map MLP M+N Queries Exacted Festure NRandom Queries BXCX- Feature Head Sparse BEV Feature BEV Encoder Projection Self-Attn Deph Head Images Bx3xx Spanse BEV Feature Projection Uneven BEV Feature Grid BEV Decoderx6 DALN Cross-AttnDALN FFN 1-0 Class Alignment Loss Detection Results Query FFN H H DALN Domain Confidence (Gr. Cz... C) Visualization nuScenes ARKitScenes Hypersim E 。 Domain Parameters Sparse Tekens Even Grid (Previous) (2.B) (B) Input-dependent Parameter Although numerous 3D object detectors have been developed, they are mostly designed for a single domain. Unifying indoor and outdoor detection is challenging due to diverse geometry distributions and heterogeneous domain distributions Images from Dive Denains Des Feature Sparse Feature Projection on Uneven Grid Input Layer Norm Feature Mini-adjust Output Feature Uneven Grid (Ours) Unstable Training -MODE-PETR Love boot Gradient Na In this work, we propose the UniMODE detector, which achieves SOTA in unified 3D object detection. Geometry Distribution Gap Heterogeneous Domain Distributions Not labeled KITTI Objectron Main Results SUN-RGBD Comparison with Existing Detectors Method AP AP AP35 ↑ AP50 Apnear 3D ↑ Apmed ↑ AP笳↑ AP 3D ↑ 9.6% M3D-RPN 10.4% SMOKE 19.5% 9.8% FCOS3D 17.6% 11.2% PGD 22.9% GUPNet 19.9% Im VoxelNet 21.5% 9.4% BEVFormer 25.9% x x x PETR 27.8% X x X x x x Cube RCNN 31.9% 15.0% 24.9% 9.5% 27.9% 12.1% 8.5% 23.3% UniMODE 39.1% 22.3% 28.3% UniMODE* 7.4% 41.0% 29.7% 26.9% 12.7% 30.2% 8.1% 10.6% 25.5% 31.1% 14.9% 8.7% 28.2% DLA34 ConvNext 23.0% 21.0% 3D 6.7% 8.1% Backbone AP AP APark ApobjApkit Apus 3D 3D 42.3% 48.0% 52.5% 27.8% 31.7% 66.1% 29.2% 36.0% Although there are many popular 3D object detectors, we find that they cannot converge mly in unified 3D object detection By contrast, by incorporating our proposed iques, the developed detector UniMODE adies sable training Indoor Outdoor Two-Stage Detection Architecture: Utilizing the first stage network to inform the second stage about rough target distribution, the training is stabilized. To address the grid size conflict between indoor and outdoor scenes, we propose the uneven BEV grid. (a) ARKitScenes Domain Adaptive Layer Normalization: An efficient (b) Hypersim adaptive feature normalization method is developed to bridge the significant feature discrepancy between various data domains. Class Alignment Loss: We devise an effective loss to address the label conflict between various datasets. Ablation Study PH UBG SBFP UDA AP APO AP 3D ↑ Improvement 10.9% 14.3% 12.3% 13.4% 22.2% 15.9% 3.6% 14.00% 23.8% 16.6% 0.7% 13.4% 23.7% 16.6% 0.0% t ↓ ✓ 14.8% 24.5% 17.4% 0.8%