A person is pointing at a large projected display of a wall with many posters on it. The display appears to be an exhibit showcasing a poster and a chart. The person is standing in front of the board, likely giving a presentation or sharing information with others. The board is covered with posters and graphs, providing a visual representation of the information being presented.
Text transcribed from the image:
Gr
Carnegie
Mellon
University
IMW
Task
GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation
Mukul Khanna, Ram Ramrakhya", Gunjan Chhablani1, Sriram Yenamandra', Theophile Gervet2, Matthew Chang³, Zsolt Kira',
Devendra Singh Chaplot4, Dhruv Batra', Roozbeh Mottaghis
Dataset
Results
Brick replece with t
creen tv above. The
freatece is located below
the speaker
Double oven located near
the kitchen counter
kchen cabinet, sink, and
blinds
Hig
o Anything (GOAT)
Embodied agent is tasked with navigating to a sequence of
open-vocabulary goals specified via - category name,
language description or an image.
Denim jacket loceted
end the purse
Leverage simulator metadata and large language
models (LLM) to generate language goals
Habitat Simulator
Grey och Acated on the
left side of the room next
to the picture and the
ookshelf that is located
above the box and next to
the container it is red and
has a lot of books on it
SELECTING BEST VIEWPOINT
BBOX INFO
pow
Object goal Language goal image goal
white towel located near
the sink in the bathroom
cabinet and minor region
edator in the room
where there is a stack of
Jackets hanging on it
Examples of GOAT-Bench goals
Key Features
Open-Vocabulary Multi-modal Goals: Specified as
Image, language or object category. Tests
generalization to seen and unseen objects in unseen
scenes.
Lifelong Navigation: In each episode agent is
tasked with navigating to 5-10 goals in the same scene.
Reproducible: Comprehensive benchmarking of
existing methods by leveraging simulation.
RGB
w
Croco
BERT
12
OBJ
.
CLIP
.
.
GOAL
GOAL
Maps sensors to
actions to using a
separate end-to-
end trained
CNN+RNN policy
for each modality
SenseAct-NN Skill-chain
"Describe the bed"
BLIP V2
"a large bed with a
floral comforter
+prompt
24
countertop located on a
cabinet
"Find the bed with a floral comforter and a pillow in the middle."
Language goal generation pipeline
Baselines
CLIP
3-13
RGB
31-3
LANG
CLIP
CLIP CLIP
..
OBJECT DETECTION
GOAL
06
-
Maps sensors to
actions to using a
single end-to-end
trained CNN+RNN
30 PROJECTION
TOPDOWN
SEMANTIC MAP
ACTION
policy
SenseAct-NN Monolithic
OBJ
LANG
CLP
FEATURE
KEYPONT
MATONING
ACTION
INSTANCE
MAP
+6.6%
+4.0%
32.3
BSPL
CVPR
JUNE 17-21, 2024
10.5
15.9
13.1
102
Madhia GOAT
Modular
SeresAct-N
Skill chaining achieves SOTA
on success rate
Modular GOAT achieves SOTA
on SPL
Efficiency of SenseAct-NN and Modular method improves over time
-Modular GOAT
-SenseAct-NN Monolithic
-Modular GOAT
-SenseAct-MN Monolithic
20
15
10
Efficiency of
navigation improves
for both modular and
end-to-end trained
methods
2
3
5-10
5-10
Number of sub-tasks
With memory Winout memory
Number of sub-taska
Wory Whoory
Modular GOAT Sect-NN Monolithic
15.
10
Moduler GOAT SeneAch-NN Manole
End-to-end
methods do not
show drop in
performance
when long-term
memory is
disabled
Modular methods are more sensitive to noise in goal observations
LOCAL POLICY
PWPLANNER
Object goal
Without rose
Language goal
Image goal
with noise
Wehout noise with nose
22
50 Winout noise
DYNAMIC INSTANCE
MAPPING
GOAL LOCALIZATION
ي
Builds explicit map of the
environment in combination
with path planning for
navigation
Modular GOAT
Modular Skill Chain Monothic
GOAT
16.5
11
55
GOAT
25
End-to-end trained methods are robust to noise in goal specification
16