A person is pointing at a large projected display of a wall with many posters on it. The display appears to be an exhibit showcasing a poster and a chart. The person is standing in front of the board, likely giving a presentation or sharing information with others. The board is covered with posters and graphs, providing a visual representation of the information being presented. Text transcribed from the image: Gr Carnegie Mellon University IMW Task GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation Mukul Khanna, Ram Ramrakhya", Gunjan Chhablani1, Sriram Yenamandra', Theophile Gervet2, Matthew Chang³, Zsolt Kira', Devendra Singh Chaplot4, Dhruv Batra', Roozbeh Mottaghis Dataset Results Brick replece with t creen tv above. The freatece is located below the speaker Double oven located near the kitchen counter kchen cabinet, sink, and blinds Hig o Anything (GOAT) Embodied agent is tasked with navigating to a sequence of open-vocabulary goals specified via - category name, language description or an image. Denim jacket loceted end the purse Leverage simulator metadata and large language models (LLM) to generate language goals Habitat Simulator Grey och Acated on the left side of the room next to the picture and the ookshelf that is located above the box and next to the container it is red and has a lot of books on it SELECTING BEST VIEWPOINT BBOX INFO pow Object goal Language goal image goal white towel located near the sink in the bathroom cabinet and minor region edator in the room where there is a stack of Jackets hanging on it Examples of GOAT-Bench goals Key Features Open-Vocabulary Multi-modal Goals: Specified as Image, language or object category. Tests generalization to seen and unseen objects in unseen scenes. Lifelong Navigation: In each episode agent is tasked with navigating to 5-10 goals in the same scene. Reproducible: Comprehensive benchmarking of existing methods by leveraging simulation. RGB w Croco BERT 12 OBJ . CLIP . . GOAL GOAL Maps sensors to actions to using a separate end-to- end trained CNN+RNN policy for each modality SenseAct-NN Skill-chain "Describe the bed" BLIP V2 "a large bed with a floral comforter +prompt 24 countertop located on a cabinet "Find the bed with a floral comforter and a pillow in the middle." Language goal generation pipeline Baselines CLIP 3-13 RGB 31-3 LANG CLIP CLIP CLIP .. OBJECT DETECTION GOAL 06 - Maps sensors to actions to using a single end-to-end trained CNN+RNN 30 PROJECTION TOPDOWN SEMANTIC MAP ACTION policy SenseAct-NN Monolithic OBJ LANG CLP FEATURE KEYPONT MATONING ACTION INSTANCE MAP +6.6% +4.0% 32.3 BSPL CVPR JUNE 17-21, 2024 10.5 15.9 13.1 102 Madhia GOAT Modular SeresAct-N Skill chaining achieves SOTA on success rate Modular GOAT achieves SOTA on SPL Efficiency of SenseAct-NN and Modular method improves over time -Modular GOAT -SenseAct-NN Monolithic -Modular GOAT -SenseAct-MN Monolithic 20 15 10 Efficiency of navigation improves for both modular and end-to-end trained methods 2 3 5-10 5-10 Number of sub-tasks With memory Winout memory Number of sub-taska Wory Whoory Modular GOAT Sect-NN Monolithic 15. 10 Moduler GOAT SeneAch-NN Manole End-to-end methods do not show drop in performance when long-term memory is disabled Modular methods are more sensitive to noise in goal observations LOCAL POLICY PWPLANNER Object goal Without rose Language goal Image goal with noise Wehout noise with nose 22 50 Winout noise DYNAMIC INSTANCE MAPPING GOAL LOCALIZATION ي Builds explicit map of the environment in combination with path planning for navigation Modular GOAT Modular Skill Chain Monothic GOAT 16.5 11 55 GOAT 25 End-to-end trained methods are robust to noise in goal specification 16