The image depicts a healthcare professional resting their hand on a poster that is on a wall. The poster is informative and outlines the importance of testing for liver function and the use of polipoint machine. The professional's hand is poised on a model of body structure, which makes it easier to identify with. The poster is written in multiple languages, indicating that it is intended to be accessible to people of different linguistic backgrounds. The image also shows a poster with a healthcare paper near a display, and a hand is poking the poster in front of other posters. The poster is likely intended to provide information to patients about the importance of regular health check-ups and preventive measures. Overall, the image is an informative and engaging way to convey important healthcare information to patients and the public. Text transcribed from the image: CVPR SEATTLE, WA JUNE 17-21, 2024 Motivation & Model Architecture How to fuse the point into LLMs? Point Self-surperised Learning Backbone Point-text Pretraining 3D Point-Tex 3D (Point) MLLMs QT QT Q Q1 22 C Transformer Stage1: Point-Text Feature Alignment QT T₁ QT T₂ QNTN TN Q1 Q2 QN Large Language Model Zero-shot Point-Text Retrival Point-Text Matching Point-Text Generation ↑ Point Encoder Stage2: Point Understing and Generation Fully Connected CLASS Large Language Model planet airplane guitar Point Encoder Point Q-Former Fully Connected Point Q-Former A bowl of objects sits on top of a table. Text Tokenizer A bowl of objects sits on top of a table. Text Tokens Queries Gradient A bowl of objects sits on top of a table. Diffusion Branch De We use the weights of PointBERT as the point encoder backbone. We use Point Q-Former to fuse the point and text modalities. In stage 1, we first align the two modalities, and in stage 2, we integrate the frozen LLMs. Point Encoder CLIP Encoder Point Q-Former CLS Token CLS Token Point-E A bowl of objects sits on top of a table. Anoma 3D Und It performs acc GPT4Point exc Cost Lev Contribution We present GPT4Point, a unified framework for point-language understanding and generation, including 3D MLLM for point-text tasks and controlled 3D generation. We introduce Pyramid-XL, an automated point-language dataset annotation engine based on Objaverse-XL, encompassing 1M pairs with varying levels of coarseness, extendable cost-effectively. We establish a novel object-level point cloud benchmark with comprehensive metrics to thoroughly assess models' understanding and evaluate generated 3D objects. SA Quality Level 2: Level VLM a Level 2