Reason3D

How do LMMs handle 3D segmentation?

Recent advancements in multimodal large language models (LLMs) have shown their potential in various domains, especially concept reasoning. Despite these developments, applications in understanding 3D environments remain limited. They primarily offer textual or numerical outputs without the capability to generate dense, informative segmentation masks.

This paper introduces Reason3D, a novel LLM designed for comprehensive 3D understanding. Reason3D takes point cloud data and text prompts as input to produce textual responses and segmentation masks, facilitating advanced tasks like 3D reasoning segmentation, hierarchical searching, express referring, and question answering with detailed mask outputs.

Problem Setting: 3D Reasoning Segmentation

We introduce a Novel Problem Setting, requiring LMMs that can segment objects from the query. We show two examples about 3D reasoning segmentation that requires in-depth world knowledge and reasoning understanding.

Propoased Framework: Reason3D

Initially, we utilize a point encoder to extract dense features from the input scene, simplified by a superpoint pooling layer to reduce complexity. An interactor merges superpoint features with a learnable query, input into a frozen LLM along with instructions to generate an output containing critical tokens, [LOC] and [SEG]. A hierarchical decoder then uses the [LOC] embedding to estimate a coarse location that likely covers the object. Finally, this estimated location integrates with the [SEG] embedding, enabling the prediction of the final segmentation masks.

Visualization Results: Reasoning 3D Segmentation

BibTeX

@inproceedings{reason3d,
  title={Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model},
  author={Kuan-Chih Huang and Xiangtai Li and Lu Qi and Shuicheng Yan and Ming-Hsuan Yang},
  booktitle={International Conference on 3D Vision (3DV)},
  year={2025}
}