Photograph of a research poster presented at the CVPR 2024 conference in Seattle, WA, held from June 17-21. The poster, titled "Motivation & Model Architecture," discusses how to fuse point data into Large Language Models (LLMs). The illustrated process is broken into several stages: 1. **Point Self-Supervised Learning Backbone** and **Point-Text Pretraining:** - Utilizes a transformer with PointBERT weights. - Point Q-Former is used to fuse point and text modalities. 2. **Stage1: Point-Text Feature Alignment:** - Aligns point-text features using Point Encoder and Point Q-Former. 3. **Stage2: Point Understanding and Generation:** - Integrates frozen LLMs for advanced understanding and generation tasks. The **Contribution** section highlights the key innovations: - **GPT4Point:** A unified framework for point-language understanding and generation. - **Pyramid-XL:** An automated point-language dataset annotation engine. - **Novel Object-Level Point Cloud Benchmark:** Comprehensive metrics for assessing models' understanding and evaluating generated 3D objects. An illustrated step-by-step model architecture and description are detailed, along with a collaborative diagram showing the diffusion branch functioning between encoders. A hand is seen pointing towards the diffusion branch section of the poster. The background skyline and QR code at the top add visual appeal and provide additional resources for attendees. Text transcribed from the image: CVPR SEATTLE, WA JUNE 17-21, 2024 Motivation & Model Architecture How to fuse the point into LLMs? Point Self-surperised Learning Backbone Point-text Pretraining 3D Point-Tex 3D (Point) MLLMs QT QT Q Q1 22 C Transformer Stage1: Point-Text Feature Alignment QT T₁ QT T₂ QNTN TN Q1 Q2 QN Large Language Model Zero-shot Point-Text Retrival Point-Text Matching Point-Text Generation ↑ Point Encoder Stage2: Point Understing and Generation Fully Connected CLASS Large Language Model planet airplane guitar Point Encoder Point Q-Former Fully Connected Point Q-Former A bowl of objects sits on top of a table. Text Tokenizer A bowl of objects sits on top of a table. Text Tokens Queries Gradient A bowl of objects sits on top of a table. Diffusion Branch De We use the weights of PointBERT as the point encoder backbone. We use Point Q-Former to fuse the point and text modalities. In stage 1, we first align the two modalities, and in stage 2, we integrate the frozen LLMs. Point Encoder CLIP Encoder Point Q-Former CLS Token CLS Token Point-E A bowl of objects sits on top of a table. Anoma 3D Und It performs acc GPT4Point exc Cost Lev Contribution We present GPT4Point, a unified framework for point-language understanding and generation, including 3D MLLM for point-text tasks and controlled 3D generation. We introduce Pyramid-XL, an automated point-language dataset annotation engine based on Objaverse-XL, encompassing 1M pairs with varying levels of coarseness, extendable cost-effectively. We establish a novel object-level point cloud benchmark with comprehensive metrics to thoroughly assess models' understanding and evaluate generated 3D objects. SA Quality Level 2: Level VLM a Level 2