In the image, there is a table filled with various items, including a sign, several cups, and a poster. The poster is displayed on a bulletin board, which also features other posters and signs, creating an interesting display. The table is surrounded by people, and there is a cup of soda on the table as well. The posters and signs on the bulletin board appear to be related to UCLA and may contain information about the university or events happening on campus. Text transcribed from the image: Ucla Ⓒ 1958 Image Text +Audio Motivation Response ↑ ImageBind-LLM Point Cloud Video ↑ Point Cloud Instruction 1 if www SOUTH CHIN POINT-BIND OF TECHNOLO mage MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World Yining Hong, Zishuo Zheng, Peihao Chen, Yian Wang, Junyan Li, Chuang Gan The MultiPLY Framework MIT-IBM Watson AI Lab UMass Amherst Experimental Results Interaction with Multisensory Environment Model ConceptGraph+CLAP Sensor Set Language +Audio Current multi-modal large language models, passively absorb sensory data as inputs, lacking the capacity to actively interact with the objects in the embodied 3D environment and dynamically collect their multisensory information. Moreover, they use binding methods to bind embeddings of other modalities to images, while multisensory data could be obtained only through interaction. Multisensory Interaction Data Collection Visual Temperature Tactile Choose / touch the donut. Concept Graphs Ambient sound MultiPLY A: I heard microwave beeping and plan to g o go toward it temperature>. No, it is hard and cold. Overview of our MultiPLY. We first encode the scene as an abstracted object-centric representation, while multisensory details of objects can only be unveiled when the agent executes an action and interacts with them. We devise a set of action tokens denoting the actions of agents to interact with the environment. The interaction results are appended back to the LLM via state tokens.. What MultiPLY Could do Temperature Visual Impact Sound Context (Bounding box, material, temperature, hardness.......): Room1: CD player: [0.3, 0.3, 0.5], plastic, hot, hard........ Room2: Donut: [0.2, 0.3, 0.1], dough, cold, hard.......; Instruction (Shortened Version): You are an Al assistant / task generator in the room. You need to generate a task in the scene. Demonstration: For Room 1: [Few shot example] Generate similar responses for Room 2. Response: For Room 2: Q: Is the donut ready to eat? output: output: [tactile] [temperature] 13 input: Q+I see a donut.