In the image, there is a table filled with various items, including a sign, several cups, and a poster. The poster is displayed on a bulletin board, which also features other posters and signs, creating an interesting display. The table is surrounded by people, and there is a cup of soda on the table as well. The posters and signs on the bulletin board appear to be related to UCLA and may contain information about the university or events happening on campus.
Text transcribed from the image:
Ucla Ⓒ
1958
Image
Text
+Audio
Motivation
Response
↑
ImageBind-LLM
Point Cloud
Video
↑
Point Cloud
Instruction
1
if
www
SOUTH CHIN
POINT-BIND
OF TECHNOLO
mage
MultiPLY: A Multisensory Object-Centric Embodied
Large Language Model in 3D World
Yining Hong, Zishuo Zheng, Peihao Chen, Yian Wang, Junyan Li, Chuang Gan
The MultiPLY Framework
MIT-IBM
Watson
AI Lab
UMass
Amherst
Experimental Results
Interaction with Multisensory Environment
Model
ConceptGraph+CLAP
Sensor Set
Language
+Audio
Current multi-modal large language models, passively absorb
sensory data as inputs, lacking the capacity to actively interact with
the objects in the embodied 3D environment and dynamically
collect their multisensory information.
Moreover, they use binding methods to bind embeddings of other
modalities to images, while multisensory data could be obtained
only through interaction.
Multisensory Interaction Data Collection
Visual
Temperature
Tactile
Choose / touch
the donut.
Concept
Graphs
Ambient sound
<select> <navigate>
3D scene graph
Audio features :
Q: Is the donut ready to eat?
<select> <observe> <touch>
MultiPLY
A: <scene> <scene> <ambient sound> <lambient sound> I heard microwave beeping and
plan to g
o go toward it <select> <navigate>. After navigation, I see a donut <select> <observe> <object>
<object> <touch><tactile> <tactile> <temperature> temperature>. No, it is hard and cold.
Overview of our MultiPLY. We first encode the scene as an abstracted object-centric representation, while
multisensory details of objects can only be unveiled when the agent executes an action and interacts with them. We
devise a set of action tokens denoting the actions of agents to interact with the environment. The interaction results
are appended back to the LLM via state tokens..
What MultiPLY Could do
Temperature
Visual
Impact Sound
Context (Bounding box, material, temperature, hardness.......):
Room1: CD player: [0.3, 0.3, 0.5], plastic, hot, hard........
Room2: Donut: [0.2, 0.3, 0.1], dough, cold, hard.......;
Instruction (Shortened Version):
You are an Al assistant / task generator in the room.
You need to generate a task in the scene.
Demonstration: For Room 1: [Few shot example]
Generate similar responses for Room 2.
Response: For Room 2:
Q: Is the donut ready to eat?
output: <select> (Choose (donut)]
11 input: Q+I see a donut.
12 input: Q+I see a donut. <select>
output: <touch> [tactile] [temperature]
13 input: Q+I see a donut. <select> <touch> [tactile] [temperature]
output: It is hard, cold and not ready to eat.
Temperature
Tactile
The donut is cold.
It is hard when
touching i
Is the donut
ready?
No, you need
to reheat it,
-----
Multisensory Chatting
What tools can I use to drive
a nail in wat?
There are two bottles. Which
one should I use?
Visual
Opening it, L
see a donut
Audio
I heard microwave
beeping and walk to
You can use hammer but no hammer in this
room. You can also use a hard bottle
Let me touch and hit them
You should use bottle 1 as it is hard and
Multisensory Question-Answering
s the heater turned on?
I found two heaters. The let
one is hot and working while
the right one is not
Multisensory Navigation.
Lead me to the sound source.
I heard the sound.
It may be emitted by a phone.
The phone may in the living room
Multisensory Captioning
It is a pink piggy saving jar. W
It is hard and made of ceramic
ConcepeGraph+CLIP
Retrieval Accuracy
Model
14.5
ConceptGraph+CLIP
ConceptGraph+ImageBind
20.3
ConceptGraph+ImageBind-I
24.7
ConceptGraph+ImageBind-1 (Finetuned)
367
MultiPLY-2D
MultiPLY-2D
44.6
ConcepeGraph+PointBind
195
ConceptGraph+PointBind-1
227
ConceptGraph+PointBind-1 (Finetuned)
PointBind-LLM (Finetuned)
MultiPLY
404
ConceptGraph+ImageBind
ConceptGraph+ImageBind-1
ConceptGraph+ImageBind-I (Finetune
ConceptGraph+PointBind
ConcepeGraph+PointBind-1
ConceptGraph+PointBind-I (Finetuner
48.9
56.7
PointBind LLM (Finetuned)
MultiPLY
(a) Object Retrieval
(b) Tool Use
LLAVA
BLEUI BLEU4 METEOR
95
7.1
LLAVA (Finetused)
28.6 10.1
104
LLAVA wo Interaction
3D-LLM
144
15
95
LLAVA w Interaction
3D-LLM (Finetuned i
312 12.1
124
3D-LLM wo leseraction
PointBind-LLM
16.5
7.7
3D-LLM v leteraction
PointBind-LLM (Finetuned)
36.7
14.5
15.1
MuPLY
MultPLY
48.9
20.1
(c) Captioning
(d) Task De
Is the CD
player on for a
long time?
Qualitative Example
Audio
Thear the sand
da song
Navigation
The player is on
Navigation
the bio sect
Temp: 165 F
Hit plastic
Temp: 75 F
Hit ceramic
Hit cera
Get me a
ceramic cup
with hot coffee