A detailed caption for this image could be: "Participants examine a scientific poster titled 'GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation' presented at the CVPR conference held from June 18-22, 2023. The poster features contributions from multiple researchers affiliated with different institutions including the Georgia Institute of Technology, the University of Illinois, and the University of Washington. It provides an in-depth analysis of a multi-modal agent, nicknamed GOAT, designed to navigate various indoor environments through the use of open-vocabulary goals such as images, language, and objects. Key points presented include various datasets, baseline models, the methodology involving GOAT's navigation tasks, and detailed charts illustrating results in skill learning, efficiency improvement, and error tolerance. One of the participants points to the 'Task' section of the poster, highlighting the different navigational challenges the agent was subjected to, which are further elucidated by diagrams and graphs." Text transcribed from the image: Gr Carnegie Mellon University IMW Task GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation Mukul Khanna, Ram Ramrakhya", Gunjan Chhablani1, Sriram Yenamandra', Theophile Gervet2, Matthew Chang³, Zsolt Kira', Devendra Singh Chaplot4, Dhruv Batra', Roozbeh Mottaghis Dataset Results Brick replece with t creen tv above. The freatece is located below the speaker Double oven located near the kitchen counter kchen cabinet, sink, and blinds Hig o Anything (GOAT) Embodied agent is tasked with navigating to a sequence of open-vocabulary goals specified via - category name, language description or an image. Denim jacket loceted end the purse Leverage simulator metadata and large language models (LLM) to generate language goals Habitat Simulator Grey och Acated on the left side of the room next to the picture and the ookshelf that is located above the box and next to the container it is red and has a lot of books on it SELECTING BEST VIEWPOINT BBOX INFO pow Object goal Language goal image goal white towel located near the sink in the bathroom cabinet and minor region edator in the room where there is a stack of Jackets hanging on it Examples of GOAT-Bench goals Key Features Open-Vocabulary Multi-modal Goals: Specified as Image, language or object category. Tests generalization to seen and unseen objects in unseen scenes. Lifelong Navigation: In each episode agent is tasked with navigating to 5-10 goals in the same scene. Reproducible: Comprehensive benchmarking of existing methods by leveraging simulation. RGB w Croco BERT 12 OBJ . CLIP . . GOAL GOAL Maps sensors to actions to using a separate end-to- end trained CNN+RNN policy for each modality SenseAct-NN Skill-chain "Describe the bed" BLIP V2 "a large bed with a floral comforter +prompt 24 countertop located on a cabinet "Find the bed with a floral comforter and a pillow in the middle." Language goal generation pipeline Baselines CLIP 3-13 RGB 31-3 LANG CLIP CLIP CLIP .. OBJECT DETECTION GOAL 06 - Maps sensors to actions to using a single end-to-end trained CNN+RNN 30 PROJECTION TOPDOWN SEMANTIC MAP ACTION policy SenseAct-NN Monolithic OBJ LANG CLP FEATURE KEYPONT MATONING ACTION INSTANCE MAP +6.6% +4.0% 32.3 BSPL CVPR JUNE 17-21, 2024 10.5 15.9 13.1 102 Madhia GOAT Modular SeresAct-N Skill chaining achieves SOTA on success rate Modular GOAT achieves SOTA on SPL Efficiency of SenseAct-NN and Modular method improves over time -Modular GOAT -SenseAct-NN Monolithic -Modular GOAT -SenseAct-MN Monolithic 20 15 10 Efficiency of navigation improves for both modular and end-to-end trained methods 2 3 5-10 5-10 Number of sub-tasks With memory Winout memory Number of sub-taska Wory Whoory Modular GOAT Sect-NN Monolithic 15. 10 Moduler GOAT SeneAch-NN Manole End-to-end methods do not show drop in performance when long-term memory is disabled Modular methods are more sensitive to noise in goal observations LOCAL POLICY PWPLANNER Object goal Without rose Language goal Image goal with noise Wehout noise with nose 22 50 Winout noise DYNAMIC INSTANCE MAPPING GOAL LOCALIZATION ي Builds explicit map of the environment in combination with path planning for navigation Modular GOAT Modular Skill Chain Monothic GOAT 16.5 11 55 GOAT 25 End-to-end trained methods are robust to noise in goal specification 16