PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning

1Xi'an Jiaotong University, 2Tsinghua University, 3Shanghai Jiao Tong University, 4Zhejiang University, 5Nanyang Technological University, 6Harbin Institute of Technology, Shenzhen, 7Institute of Automation, CASIA, 8Taiyuan University of Technology, 9Shanghai AI Laboratory
*Equal contribution
PointCoT: Explicit Look-Think-Answer Paradigm vs. Geometric Hallucination

PointCoT introduces an explicit Look–Think–Answer paradigm that empowers Multimodal Large Language Models with geometry-grounded Chain-of-Thought reasoning over 3D point clouds, significantly reducing geometric hallucinations and enabling interpretable 3D understanding.

Abstract

While Multimodal Large Language Models (MLLMs) demonstrate proficiency in 2D scenes, extending their perceptual intelligence to 3D point cloud understanding remains a significant challenge. Current approaches focus primarily on aligning 3D features with pre-trained models. However, they typically treat geometric reasoning as an implicit mapping process. These methods bypass intermediate logical steps and consequently suffer from geometric hallucinations—confidently generating plausible responses that fail to ground in precise structural details.

To bridge this gap, we present PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data. We advocate for a Look, Think, then Answer paradigm, in which the model is supervised to generate geometry-grounded rationales before predicting final answers. To facilitate this, we construct Point-Reason-Instruct, a large-scale benchmark comprising ~86k instruction-tuning samples with hierarchical CoT annotations. By leveraging a dual-stream multi-modal architecture, our method synergizes semantic appearance with geometric truth. Extensive experiments demonstrate that PointCoT achieves state-of-the-art performance on complex reasoning tasks.

Motivation

Pioneering works like PointLLM and 3D-LLM have successfully projected 3D point cloud features into the LLM input space, enabling basic 3D Question Answering. Despite these advancements, a critical limitation remains overlooked: most existing 3D-LLMs treat geometric reasoning as a black-box mapping process, training models to map input point clouds directly to final answers end-to-end while bypassing explicit reasoning steps.

When facing complex spatial tasks—such as judging whether a chair with a missing leg is stable—these models suffer from what we term Geometric Hallucination (illustrated above). A conventional model might correctly identify the semantic category but fail to ground its judgment in fine-grained geometric details (e.g., the missing leg), leading to plausible but factually incorrect conclusions.

This problem is rooted in two key challenges:

  • Data Scarcity. Existing 3D benchmarks offer simple pair-wise annotations and lack the explicit rationale supervision needed to train an interpretable reasoning process.
  • Modality Gap. Unlike rich 2D images, point clouds are sparse and lack semantic texture, making pure geometric reasoning difficult, while rendered images suffer from depth ambiguity.
👁 Look
Perceive fine-grained geometry. The model actively selects decisive viewpoints and scans specific structural regions of the 3D point cloud and multi-view renders.
💭 Think
Derive an explicit rationale grounded in spatial evidence. Intermediate reasoning tokens are anchored back to invariant geometric features via a contrastive InfoNCE loss.
✅ Answer
Deduce the final conclusion conditioned on both the geometric manifold and the verified reasoning chain, yielding interpretable and hallucination-resistant predictions.

Method

PointCoT operationalizes the Look–Think–Answer paradigm through a Tri-Modal Contextualization architecture integrating three stages:

Stage 1 — Look: Tri-Modal Contextualization

The alignment function Ealign integrates three modalities into a unified geometric-semantic manifold z. A Point Encoder (PointBERT) extracts high-dimensional geometric representations and absolute 3D spatial centroids from the raw point cloud. Concurrently, a Vision Encoder (EVA-CLIP ViT-g/14) embeds 8 spherical multi-view images into a visual semantic space. A novel Geometry-Guided Cross-Modal Attention (GCMA) module then synchronizes geometric and visual tokens by modulating their interaction with physical camera projection priors and a learnable spatial bandwidth constraint, augmented by Fourier-based relative spatial embeddings to capture high-frequency morphological variations. A dynamic occlusion-aware gate further mitigates 3D-to-2D projection artifacts. The resulting fused tokens are concatenated with tokenized text instructions to form the tri-modal manifold z.

Stage 2 — Think: Explicit Geometry-Grounded CoT Generation

The LLM autoregressively generates an explicit rationale chain R = {rt} conditioned on the manifold z. To prevent spatial hallucinations, a geometric attention anchor is enforced: at each decoding step t, the hidden state ht is regularized via a contrastive InfoNCE loss (Lanchor) that maximizes mutual information between the evolving reasoning state and the invariant geometric tokens. This theoretical grounding ensures that every generated reasoning token is conditioned on physical 3D truths rather than spurious 2D visual priors.

Stage 3 — Answer: Progressive Dual-Stage Optimization

Training follows a two-stage curriculum. In Stage I (Reasoning Initialization), the model is trained solely on rationale generation with the answer prediction gradient truncated, forcing the acquisition of explicit geometric reasoning and preventing shortcut learning. In Stage II (Causal Deduction Tuning), the full joint objective is optimized using teacher forcing with the ground-truth reasoning trajectory as context. The geometric anchor loss Lanchor acts as a self-correcting mechanism throughout, bounding the autoregressive search space to ensure robust deductive resolution.

PointCoT Architecture
Figure 3. The overall architecture of PointCoT. In the Look Stage, a dual-stream encoder (PointBERT + EVA-CLIP) extracts geometric and visual features, fused into a Tri-Modal Manifold z via GCMA. During the Think Stage, the LLM autoregressively generates an explicit rationale R, while hidden states are grounded to geometric tokens via an InfoNCE loss Lanchor. In the Answer Stage, the final answer is deduced conditionally on both z and R through progressive dual-stage optimization.

Point-Reason-Instruct Dataset

To democratize research on explicit 3D CoT reasoning, we construct Point-Reason-Instruct—the first large-scale dataset combining 3D point clouds with explicit CoT annotations. Unlike previous benchmarks that provide simple input-output pairs, each sample in our dataset provides a triplet of ⟨Point Cloud, Multi-view Images, CoT Rationale⟩, enabling models to learn how to reason, not just what to answer. The dataset is publicly available on HuggingFace.

86,280
Total instruction-tuning samples
28,760
Unique 3D objects (Strict Object-Level Split)
3
Cognitive reasoning levels
8
Spherical views per object (6 azimuth + zenith + nadir)

Construction Pipeline

Data Sourcing. We curate objects from the Objaverse-LVIS subset using a topology-aware filtering protocol that prioritizes objects with complex geometries and articulated parts while filtering out overly simplistic shapes. For each object, point clouds are generated by sampling N = 8,192 points via Farthest Point Sampling (FPS).

Holistic Multi-View Rendering. We design a Spherical 8-View System for holistic coverage: six equidistant azimuth angles (0°–300°) at 30° elevation, plus zenith and nadir views. The orthogonal views capture cavities, inner surfaces, and supporting mechanisms (e.g., chassis) that are typically occluded in standard horizontal scans, enabling reasoning about object affordances such as containability.

Geometry-Aware Teacher Agent. Scalable annotation is driven by Qwen2.5-VL-72B-Instruct, which generates hierarchical CoT rationales across three task types: geometric attribute reasoning, spatial relation reasoning, and functionality & affordance reasoning. A deterministic cross-validation protocol rigorously cross-references every spatial assertion against rigid 3D object metadata and multi-view consistency constraints, discarding candidates that fail geometric verification to guarantee that the final corpus is anchored in physical 3D truth.

Point-Reason-Instruct Construction Pipeline
Figure 2. The Data Construction Pipeline of Point-Reason-Instruct. (1) Dual-Stream Preprocessing: objects are sampled into point clouds and rendered into 8 spherical views. (2) Multi-Task Reasoning Generation: the Qwen2.5-VL teacher agent generates hierarchical CoT rationales covering geometric attributes, spatial relations, and functionality. (3) Quality Filtering: rationales are validated against 3D metadata to eliminate hallucinations.

Hierarchical Task Design

The dataset is structured into three cognitive hierarchies to rigorously evaluate reasoning depth, ranging from local perception to abstract deduction:

Level 1 Structural Part Reasoning (~28k samples)

Identification and structural analysis of explicit object components. Queries require recognizing specific parts (e.g., armrests, legs, chassis), counting them, and analyzing geometric integrity. Example: "How many armrests does this chair have, and are they connected to a central axis?"

Level 2 3D Viewpoint Reasoning (~29k samples)

Holistic understanding of 3D structures and spatial perspectives. The model performs mental rotation or infers details of occluded viewpoints (e.g., underside, rear surface) and reasons about relative spatial arrangement. Example: "Describe the geometric structure of the chair's back from the rear view."

Level 3 Functionality and Affordance Reasoning (~29k samples)

Physics-grounded causal reasoning. The model bridges static form and dynamic function, applying principles (gravity, friction, containment) to deduce interactions directly from geometry. Example: "Would this container spill its contents if tilted 45 degrees?"

Strict Object-Level Splitting prevents data leakage: 22,871 objects (79.5%) for training, 2,945 (10.2%) for validation, and 2,944 (10.2%) for testing. Geometric shapes encountered during inference are entirely unseen during training.

Experimental Results

We evaluate PointCoT on the Point-Reason-Instruct benchmark and additionally assess zero-shot generalization on Objaverse-LVIS (open-vocabulary captioning) and ScanQA (complex indoor spatial reasoning). Our architecture integrates PointBERT and EVA-CLIP (ViT-g/14) as encoders with Qwen2.5-7B-Instruct as the LLM backbone. Training runs on 8× NVIDIA A100 (80GB) GPUs.

Main Results on Point-Reason-Instruct

Model Modality Backbone (Encoder / LLM) Overall Geo. Spat. Func.
General-Purpose 2D VLMs (Zero-shot)
GPT-4V Img Proprietary / GPT-4 65.458.268.571.2
Qwen2-VL-7B Img ViT-L / Qwen2 63.854.564.269.5
LLaVA-1.5 Img ViT-L / Vicuna-7B 54.147.251.858.9
Specialized 3D-LLMs (Fine-tuned)
Chat-3D v2 PC ViT-g / Vicuna-7B 66.172.463.262.8
Point-LLM PC PointBERT / Vicuna-7B 62.468.159.258.5
Point-Bind PC PointBERT / Llama2-7B 58.165.255.452.3
PointCoT (Ours) PC+Img PointBERT / Qwen2.5-7B 78.582.376.475.1

Accuracy (%) across three cognitive dimensions on Point-Reason-Instruct. Geo. = structural perception; Spat. = spatial positioning/connectivity; Func. = physics-grounded causal reasoning. PointCoT outperforms the strongest fine-tuned baseline by +12.4%.

Architecture Agnosticism & Scalability

LLM Backbone Point Encoder Overall Acc (%) Reasoning Score (GPT-4, 1–10)
Vicuna-7B PointBERT 73.47.9
Mistral-7B-v0.3 PointBERT 76.88.3
Qwen2.5-7B (Ours) PointBERT 78.58.5
Qwen2.5-7B PointNeXt 79.68.7

PointCoT with the weaker Vicuna-7B backbone (73.4%) still significantly outperforms the end-to-end Point-LLM baseline (62.4%), confirming the intrinsic value of explicit rationale generation.

Zero-Shot Generalization & Data Efficiency

Method Training Data Scale ScanQA (BLEU-4) Objaverse (Acc %)
3D-LLM 300k (4.3×) 24.549.2
Point-LLM 660k (9.6×) 22.445.1
PointCoT (Ours) ~69k (1.0×) 23.451.8 (+2.6)

Despite training on merely ~69k samples, PointCoT achieves leading 51.8% on Objaverse-LVIS, demonstrating exceptional data efficiency through mastery of localized geometric primitives. Evaluated on localized instance crops aligned with our object-centric regime.

Automated Evaluation of Rationale Quality

Method Correctness Logic Grounding Average
Point-LLM (CoT-adapted) 7.16.86.26.7
GPT-4V 8.18.47.68.0
PointCoT (Ours) 8.48.68.98.6

GPT-4 evaluates 200 sampled rationales conditioned on ground-truth 3D metadata. PointCoT achieves the highest Grounding score (8.9), confirming deductive chains are firmly anchored in verifiable 3D spatial evidence.

Ablation Studies

(a) Input Modality Ablation

ConfigurationAcc (%)Score
Image Only55.25.1
Point Only64.85.8
Hybrid (Ours)78.58.5

Hybrid modality provides +13.7% over point-only, confirming dual-stream synergy.

(b) Reasoning Strategy Ablation

StrategyAcc (%)GHR ↓
Direct Mapping67.425.4%
Implicit CoT71.218.2%
Explicit CoT (Ours)78.55.1%

GHR = Geometric Hallucination Rate. Explicit CoT reduces hallucinations from 25.4% to 5.1%.

Qualitative Analysis

Qualitative Results
Figure 4. Qualitative comparison on Point-Reason-Instruct. While the baseline relies on implicit semantic priors and suffers from geometric hallucinations, PointCoT follows a Look–Think–Answer paradigm, actively selecting decisive viewpoints and explicitly grounding rationales in local 3D geometric structures.
Case A — Fine-Grained Structural Reasoning
Q: Is the mascot holding anything in its left hand?
✗ Baseline: "The object is a mascot with two arms and two hands."
✓ PointCoT: Look Scan the palm area in lateral renders and point cloud.   Think Curved finger geometry and a separate cylindrical point cluster indicate an object.   Answer Yes, holding a stick-like object.
Case B — Topological Detection via Zenith View
Q: Are there any visible openings or gaps on the very top of the windmill's roof?
✗ Baseline: "The windmill structure appears standard in side views."
✓ PointCoT: Look Use the Zenith View to bypass side blade occlusion.   Think Top-down geometry confirms a continuous manifold without any structural voids.   Answer No, it is a fully closed structure.
Case C — Geometric Profile Estimation via Bottom View
Q: Is the monitor's base circular or rectangular?
✗ Baseline: "This is a standard monitor with a central stand."
✓ PointCoT: Look Examine the stand footprint in Bottom View.   Think The geometric boundary clearly shows four sharp corners, forming a rectangular profile.   Answer The base is rectangular.

BibTeX

@article{zhang2026pointcot,
  author    = {Zhang, Dongxu and Sun, Yiding and Li, Pengcheng and Liu, Yumou and Lin, Hongqiang
               and Xu, Haoran and Mu, Xiaoxuan and Lin, Liang and Yan, Wenbiao and Yang, Ning
               and Fang, Chaowei and Zhao, Juanjuan and Zhu, Jihua and He, Conghui and Tan, Cheng},
  title     = {PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning},
  journal   = {arXiv preprint arXiv:2602.23945},
  year      = {2026},
}