The image depicts a large white board with a long panel of information and a group of people looking at it. The board appears to be a visual representation of information, possibly a display of a model or a floor plan. The people in the image are all looking at the board, some of them are holding a backpack, and at least one person is wearing glasses. The scene is likely a study or a meeting room, with the board serving as a central point of focus.
Text transcribed from the image:
Unseen: Visual Common Sense for Semantic Placement
Samrakhya¹, Aniruddha Kembhavi², Dhruv Batra¹, Zsolt Kira', Kuo-Hao Zeng2*, Luca Weihs2*
LAION-400M
How to learn Semantic Placement?
Key Idea: Use synthetically generated real world and simulation data
Inpainting real images
Synthetic Data
Habitat Simulator
State before object placement
Remove object
N
using inpainting
Use original
detections as
labels
Automatic data generation pipeline
Prompt: 4k, HD
Sample a pair of objects
Inpainted Image
Pass
Stable Diffusion
LAMA
Detic
Detic
&
SAM
Filter
Stable
Diffusion
SDEdit
1-p
Fail
Discard
1M images
distractor objects
5% noise
Images
(B) Find Objects of Interest
(C) Inpaint Objects of Interest
(D) Filter
Model architecture
CLIP-ResNet50
7x7x2048
1024
CLIP-TextEncoder
Cushion
256
1024
Target: Cushion
Sensor Pose
Re
Embodie
Given an
with placing
meaningful l
Embodied Sema
2x Augmented Images
(E) Enhance Image Quality
Observations (0)
RGBD
Target Category
(ex: "cushion)
Observ
Mask Prediction M
Preference
LLM+Detect
LLaVal
GPT4V
CLIP-UNet (