The image depicts a convention center with a whiteboard and posters advertising a "SAFE FOLD-GEN GAS AYCRP ONE MILE AWAY." People are walking by and some are holding handbags. There are also a few people sitting in chairs, and a dining table can be seen in the background. The overall atmosphere is one of a professional conference.
Text transcribed from the image:
objects ladversarial
告
171
ICT
ObjectNav task:
Motivation
Agents need to learn contextual
relationships between objects to
deduce the target's location from
visual clues.
Modular methods
Explicit semantic local map
Lacking contextual relation
learning
Self-supervised generative
map (SGM)
learning joint contextual
relations of the objects in a self-
supervised manner
generating the unobserved
regions of the local map
Imagine Before Go: Self-Supervised Generative Map for
Global maps
Episodic
observations
General
knowledge
SGM
Object Goal Navigation
Sixian Zhang, Xinyao Yu, Xinhang Song, Xiaohan Wang, Shuqiang Jiang
Key Lab of Intelligent Information Processing Laboratory of the Chinese Academy of Sciences (CAS),
Institute of Computing Technology, University of Chinese Academy of Sciences
SGM
Self-supervised
training
Reconstruction
SGM helps agents to 'imagine' unobserved regions
Self-supervised generative map
Self-supervised training:
Learning the contextual
relations of both objects and
environments
Reconstructing the masked
patches
Self-supervised setting of MAE
Available information:
Episodic observations:
geometric details
General knowledge (LLMs):
rich semantic prior
Training objective
pixel-wise binary cross entropy
category-level pixel-wise lou
Visible patches m
Episodic observations
Cress-modality Fusion
km)
General knowledge
LLMs L
SPER)
Map module:
ObjectNav with SGM
RGB-D observation, target object,
sensor pose
Local semantic map
Unobserved regions generation
SGM generates unobserved regions
based on visible regions
The sampling strategy: 1) informative
and 2) adjacent to unobserved regions.
W-Avgpool(m))
V=Conv(W. K)
informative
near frontier
P=aW+(-a) sampling probability
m)~Multinomial(n, P) sampled patches
highest confidence in the target's
channel (long-term goal)
Local policy
CVPR
JUNE 17-21, 2024
SEATTLE, WA
Experimental results
Comparisons in Gibson and MP3D
3 DO PPO Wet al, 20
Red Rabbit (Ye
SGM
THDA MM, 302
SSCN ng
ENTL
150 147 12 60
348 28
636
28.4 110 5.00
27.1 11.3
202 101 10
111
34 02
253
2.36 203 104 1100
ANS Chaplet et al. 200
671 349
11
Snap Chap
2
11 12
PONT Metal, 302
716 428
14
LM Got a 3
D-aware hang
SGM (Ours)
Comparisons in HM3D
745 421
TRO GLO
172
ENCINOTOS
Open
1.06 21.2 94 6.31
139 253 10.9
1.25 27A 120
KO
3.43
321 ILD 5.12
12.5 SIN
31.2
L34 148
SGM achieves
comparable
performance with
Utilizes LLMs to
for traini
Target o
Pose
Navigation policy
Mad
FMM calculates the shortest path from
current location to the long-term goal
Map module
Navigation process with generated map
Output: action a
00-FFO
STA
BINE Ch
STA
214
273
PEANUT and Wang
existing supervised
methods
Visualization results
Visualization of navigation episode with SGM
B
Map reconstruction results
1-20
-40
Generated Map Owned Map
Self-supervised training of SGM
General knowledge enables the model to predict
the completely unobserved contextual objects
Generated Map Observed Map Generated Map Orved Map Generated Map Oberved Map
At about 30% of the navigation process (ie. t-20), the generated map accurately predicts the target's
location even though the target has not yet been observed.