The image depicts a convention center with a whiteboard and posters advertising a "SAFE FOLD-GEN GAS AYCRP ONE MILE AWAY." People are walking by and some are holding handbags. There are also a few people sitting in chairs, and a dining table can be seen in the background. The overall atmosphere is one of a professional conference. Text transcribed from the image: objects ladversarial 告 171 ICT ObjectNav task: Motivation Agents need to learn contextual relationships between objects to deduce the target's location from visual clues. Modular methods Explicit semantic local map Lacking contextual relation learning Self-supervised generative map (SGM) learning joint contextual relations of the objects in a self- supervised manner generating the unobserved regions of the local map Imagine Before Go: Self-Supervised Generative Map for Global maps Episodic observations General knowledge SGM Object Goal Navigation Sixian Zhang, Xinyao Yu, Xinhang Song, Xiaohan Wang, Shuqiang Jiang Key Lab of Intelligent Information Processing Laboratory of the Chinese Academy of Sciences (CAS), Institute of Computing Technology, University of Chinese Academy of Sciences SGM Self-supervised training Reconstruction SGM helps agents to 'imagine' unobserved regions Self-supervised generative map Self-supervised training: Learning the contextual relations of both objects and environments Reconstructing the masked patches Self-supervised setting of MAE Available information: Episodic observations: geometric details General knowledge (LLMs): rich semantic prior Training objective pixel-wise binary cross entropy category-level pixel-wise lou Visible patches m Episodic observations Cress-modality Fusion km) General knowledge LLMs L SPER) Map module: ObjectNav with SGM RGB-D observation, target object, sensor pose Local semantic map Unobserved regions generation SGM generates unobserved regions based on visible regions The sampling strategy: 1) informative and 2) adjacent to unobserved regions. W-Avgpool(m)) V=Conv(W. K) informative near frontier P=aW+(-a) sampling probability m)~Multinomial(n, P) sampled patches highest confidence in the target's channel (long-term goal) Local policy CVPR JUNE 17-21, 2024 SEATTLE, WA Experimental results Comparisons in Gibson and MP3D 3 DO PPO Wet al, 20 Red Rabbit (Ye SGM THDA MM, 302 SSCN ng ENTL 150 147 12 60 348 28 636 28.4 110 5.00 27.1 11.3 202 101 10 111 34 02 253 2.36 203 104 1100 ANS Chaplet et al. 200 671 349 11 Snap Chap 2 11 12 PONT Metal, 302 716 428 14 LM Got a 3 D-aware hang SGM (Ours) Comparisons in HM3D 745 421 TRO GLO 172 ENCINOTOS Open 1.06 21.2 94 6.31 139 253 10.9 1.25 27A 120 KO 3.43 321 ILD 5.12 12.5 SIN 31.2 L34 148 SGM achieves comparable performance with Utilizes LLMs to for traini Target o Pose Navigation policy Mad FMM calculates the shortest path from current location to the long-term goal Map module Navigation process with generated map Output: action a 00-FFO STA BINE Ch STA 214 273 PEANUT and Wang existing supervised methods Visualization results Visualization of navigation episode with SGM B Map reconstruction results 1-20 -40 Generated Map Owned Map Self-supervised training of SGM General knowledge enables the model to predict the completely unobserved contextual objects Generated Map Observed Map Generated Map Orved Map Generated Map Oberved Map At about 30% of the navigation process (ie. t-20), the generated map accurately predicts the target's location even though the target has not yet been observed.