This image shows a research poster titled "Understanding with Open Vocabularies," authored by R. Cottereau and Wei Tsang Ooi, presented at CVPR (Conference on Computer Vision and Pattern Recognition) 2024 which is to be held in Seattle, WA from June 17-21, 2024. The section displayed features various results and analysis related to "OpenESS," highlighting comparative and ablation studies, event representation learning tables, and cross-domain ESS pretraining graphs. Key points include that OpenESS achieved state-of-the-art (SoTA) results in zero-shot, fully-supervised, and open-vocabulary setups, surpassing other event representation learning methods. Factors such as better image and text knowledge adaptation, distillation strength, and label efficiency are discussed, emphasizing the potential for OpenESS to enhance scalabilities and accuracy in real-world applications. The poster includes detailed graphs and tables illustrating comparative results and analysis, with sections on methodology, and important findings marked by bullet points and color-coded data. Attendees are shown engaging with the poster, emphasizing the interactive nature of the conference presentation. Text transcribed from the image: standing with Open Vocabularies CR. Cottereau Wei Tsang Ooi CVPR JUNE 17-21, 2024 SEATTLE, WA gy er with the superpixel- c regularization loss. ng Contrastive Prompt Experiments & Analysis Comparative & Ablation Study OpenESS achieved SOTA results under zero-shot, fully-supervised, and open-vocabulary ESS setups. Tab. Compare SOTA ESS Settings Method Annotation-Free ESS MaskCLIP [100] FC-CLIP [97] OpenESS Venue DDD17 DSEC Acc mIoU Acc mIoU ECCV'22 NeurIPS'23 90.51 Ours 81.29 88.66 31.90 51.12 53.93 58.96 21.97 79.20 39.42 86.18 43.31 "road" "sidewalk" "building" Tab. Event Representation Learning Fully-Supervised ESS Ev-SegNet [2] CVPRW'19 89.76 54.81 88.61 51.76 Method Random Venue MoCoV3 [16] IBOT [101] ICCV'21 Backbone OV DDD17 VIT-S/16 VIT-S/16 DSEC E2VID [73] TPAMI' 19 85.84 48.47 80.06 44.08 Vid2E [30] CVPR 20 90.19 56.01 X 48.76 40.53 EVDistill [84] CVPR'21 58.02 X 53.65 49.21 ICLR'22 VIT-S/16 X 49.94 42.53 DTL [83] PVT-FPN [86] ICCV'21 58.80 ICCV'21 94.28 53.89 Fixt ECDP [95] ICCV 23 VIT-S/16 X 54.66 47.91 Random VIT-B/16 X 43.89 38.24 F2E T2E BeiT [3] MAE [40] ICLR'22 CVPR 22 VIT-B/16 X 52.39 46.52 VIT-B/16 X 52.36 47.56 Random SimCLR [14] ECDP [95] ResNet-50 X 56.96 57.60 ICML 20 ICCV'23 ResNet-50 X 57.22 59.06 ResNet-50 X 59.15 59.16 Random ResNet-50 X 55.56 52.86 SpikingFCN [49] EV-Transfer [61] ESS [79] ESS-Sup [79] P2T-FPN [91] EvSegformer [47] HMNet-B [38] HMNet-L [38] HALSIE [6] NCE'22 34.20 RA-L'22 88.43 ECCV'22 61.37 91.08. ECCV'22 94.57 54.64 TPAMI'23 TIP'23 CVPR 23 51.90 15.52 63.00 24.37 53.09 84.17 45.38 89.37 53.29 94.72 54.41 88.70 51.20 CVPR 23 89.80 55.00 WACV'24 92.50 60.66 89.01 52.43 OpenESS Random OpenESS Ours ResNet-50 E2VID Ours E2VID ✓ 57.01 55.01 X 61.06 54.96 Open-Vocabulary ESS ✓ 63.00 57.21 MaskCLIP [100] FC-CLIP [97] OpenESS ECCV'22 NeurIPS'23 Ours 90.50 61.27 89.81 55.01 90.68 62.01 89.97 55.67 91.05 63.00 90.21 57.21 "driveable" "walkable" "manmade" anc OpenESS exhibits better results than other event representation Fig. Cross-Domain ESS Pretraining learning methods in literature. We unveil important factors of adapting better image and text owledge to the event network, strength of distillation, linear ng, and label efficiency. ation on distillation strength 55mloU (%) 58 ↑ mloU (%) 49.74 56.07 ID 48.94 55.66 50 56 ID 55.11 45.58 55.70 45 43.17 54 42.53 55.02 52.02 53.72 54.05 38.90 40 OOD 52 53.02 OOD 52.03 34.77 39.25 35 50 30.42 28.90 34.11 49.89 30 48 O Random 25 46 23.95 DSEC-Semantic 45.30 DDD17-Seg 20 44 1% 5% 10% 20% 1% 5% 10% 20% Random 37.0 36.5 36.1 35.1 34.5 35.2 38.3 38.1 35.7 36.1 35.4 AM) 38.3 37.1 (SLIC) 35.9 35.7 35.6 150 (SAM) 150 (SLIC) 32.1 34.1 33.7 32.3 200 (SAM) 200 (SLIC) 32.9 31.7 31.1 30.1 MoCoV2 GT 29 33 37 41 29 33 37 41 34.2 3832.8 32.6 31.1 38.3 37.9 SWAV 29 33 37 41 OpenESS shed lights on future development of more scalable ESS systems in the real world. By incorporating the image-text knowledge, we anticipate event perception models to be robust and accurate to ensure safety. 104