PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning

Dongxu Zhang^1,9,*, Yiding Sun^1,9,*, Pengcheng Li^2,9,*, Yumou Liu³, Hongqiang Lin⁴, Haoran Xu¹, Xiaoxuan Mu¹, Liang Lin⁵, Wenbiao Yan⁶, Ning Yang⁷, Chaowei Fang¹, Juanjuan Zhao⁸, Jihua Zhu¹, Conghui He⁹, Cheng Tan⁹

¹Xi'an Jiaotong University, ²Tsinghua University, ³Shanghai Jiao Tong University, ⁴Zhejiang University, ⁵Nanyang Technological University, ⁶Harbin Institute of Technology, Shenzhen, ⁷Institute of Automation, CASIA, ⁸Taiyuan University of Technology, ⁹Shanghai AI Laboratory

^*Equal contribution

Paper arXiv Code Dataset

PointCoT: Explicit Look-Think-Answer Paradigm vs. Geometric Hallucination

PointCoT introduces an explicit Look–Think–Answer paradigm that empowers Multimodal Large Language Models with geometry-grounded Chain-of-Thought reasoning over 3D point clouds, significantly reducing geometric hallucinations and enabling interpretable 3D understanding.

Abstract

While Multimodal Large Language Models (MLLMs) demonstrate proficiency in 2D scenes, extending their perceptual intelligence to 3D point cloud understanding remains a significant challenge. Current approaches focus primarily on aligning 3D features with pre-trained models. However, they typically treat geometric reasoning as an implicit mapping process. These methods bypass intermediate logical steps and consequently suffer from geometric hallucinations—confidently generating plausible responses that fail to ground in precise structural details.

To bridge this gap, we present PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data. We advocate for a Look, Think, then Answer paradigm, in which the model is supervised to generate geometry-grounded rationales before predicting final answers. To facilitate this, we construct Point-Reason-Instruct, a large-scale benchmark comprising ~86k instruction-tuning samples with hierarchical CoT annotations. By leveraging a dual-stream multi-modal architecture, our method synergizes semantic appearance with geometric truth. Extensive experiments demonstrate that PointCoT achieves state-of-the-art performance on complex reasoning tasks.

Motivation

Pioneering works like PointLLM and 3D-LLM have successfully projected 3D point cloud features into the LLM input space, enabling basic 3D Question Answering. Despite these advancements, a critical limitation remains overlooked: most existing 3D-LLMs treat geometric reasoning as a black-box mapping process, training models to map input point clouds directly to final answers end-to-end while bypassing explicit reasoning steps.

When facing complex spatial tasks—such as judging whether a chair with a missing leg is stable—these models suffer from what we term Geometric Hallucination (illustrated above). A conventional model might correctly identify the semantic category but fail to ground its judgment in fine-grained geometric details (e.g., the missing leg), leading to plausible but factually incorrect conclusions.

This problem is rooted in two key challenges:

Data Scarcity. Existing 3D benchmarks offer simple pair-wise annotations and lack the explicit rationale supervision needed to train an interpretable reasoning process.
Modality Gap. Unlike rich 2D images, point clouds are sparse and lack semantic texture, making pure geometric reasoning difficult, while rendered images suffer from depth ambiguity.

👁 Look

Perceive fine-grained geometry. The model actively selects decisive viewpoints and scans specific structural regions of the 3D point cloud and multi-view renders.

💭 Think

Derive an explicit rationale grounded in spatial evidence. Intermediate reasoning tokens are anchored back to invariant geometric features via a contrastive InfoNCE loss.

✅ Answer

Deduce the final conclusion conditioned on both the geometric manifold and the verified reasoning chain, yielding interpretable and hallucination-resistant predictions.

Method

PointCoT operationalizes the Look–Think–Answer paradigm through a Tri-Modal Contextualization architecture integrating three stages:

Stage 1 — Look: Tri-Modal Contextualization

The alignment function E_align integrates three modalities into a unified geometric-semantic manifold z. A Point Encoder (PointBERT) extracts high-dimensional geometric representations and absolute 3D spatial centroids from the raw point cloud. Concurrently, a Vision Encoder (EVA-CLIP ViT-g/14) embeds 8 spherical multi-view images into a visual semantic space. A novel Geometry-Guided Cross-Modal Attention (GCMA) module then synchronizes geometric and visual tokens by modulating their interaction with physical camera projection priors and a learnable spatial bandwidth constraint, augmented by Fourier-based relative spatial embeddings to capture high-frequency morphological variations. A dynamic occlusion-aware gate further mitigates 3D-to-2D projection artifacts. The resulting fused tokens are concatenated with tokenized text instructions to form the tri-modal manifold z.

Stage 2 — Think: Explicit Geometry-Grounded CoT Generation

The LLM autoregressively generates an explicit rationale chain R = {r_t} conditioned on the manifold z. To prevent spatial hallucinations, a geometric attention anchor is enforced: at each decoding step t, the hidden state h_t is regularized via a contrastive InfoNCE loss (L_anchor) that maximizes mutual information between the evolving reasoning state and the invariant geometric tokens. This theoretical grounding ensures that every generated reasoning token is conditioned on physical 3D truths rather than spurious 2D visual priors.

Stage 3 — Answer: Progressive Dual-Stage Optimization

Training follows a two-stage curriculum. In Stage I (Reasoning Initialization), the model is trained solely on rationale generation with the answer prediction gradient truncated, forcing the acquisition of explicit geometric reasoning and preventing shortcut learning. In Stage II (Causal Deduction Tuning), the full joint objective is optimized using teacher forcing with the ground-truth reasoning trajectory as context. The geometric anchor loss L_anchor acts as a self-correcting mechanism throughout, bounding the autoregressive search space to ensure robust deductive resolution.

PointCoT Architecture — **Figure 3.** The overall architecture of PointCoT. In the *Look Stage*, a dual-stream encoder (PointBERT + EVA-CLIP) extracts geometric and visual features, fused into a Tri-Modal Manifold z via GCMA. During the *Think Stage*, the LLM autoregressively generates an explicit rationale R, while hidden states are grounded to geometric tokens via an InfoNCE loss L_anchor. In the *Answer Stage*, the final answer is deduced conditionally on both z and R through progressive dual-stage optimization.

Point-Reason-Instruct Dataset

To democratize research on explicit 3D CoT reasoning, we construct Point-Reason-Instruct—the first large-scale dataset combining 3D point clouds with explicit CoT annotations. Unlike previous benchmarks that provide simple input-output pairs, each sample in our dataset provides a triplet of ⟨Point Cloud, Multi-view Images, CoT Rationale⟩, enabling models to learn how to reason, not just what to answer. The dataset is publicly available on HuggingFace.

86,280

Total instruction-tuning samples

28,760

Unique 3D objects (Strict Object-Level Split)

Cognitive reasoning levels

Spherical views per object (6 azimuth + zenith + nadir)

Construction Pipeline

Data Sourcing. We curate objects from the Objaverse-LVIS subset using a topology-aware filtering protocol that prioritizes objects with complex geometries and articulated parts while filtering out overly simplistic shapes. For each object, point clouds are generated by sampling N = 8,192 points via Farthest Point Sampling (FPS).

Holistic Multi-View Rendering. We design a Spherical 8-View System for holistic coverage: six equidistant azimuth angles (0°–300°) at 30° elevation, plus zenith and nadir views. The orthogonal views capture cavities, inner surfaces, and supporting mechanisms (e.g., chassis) that are typically occluded in standard horizontal scans, enabling reasoning about object affordances such as containability.

Geometry-Aware Teacher Agent. Scalable annotation is driven by Qwen2.5-VL-72B-Instruct, which generates hierarchical CoT rationales across three task types: geometric attribute reasoning, spatial relation reasoning, and functionality & affordance reasoning. A deterministic cross-validation protocol rigorously cross-references every spatial assertion against rigid 3D object metadata and multi-view consistency constraints, discarding candidates that fail geometric verification to guarantee that the final corpus is anchored in physical 3D truth.

Point-Reason-Instruct Construction Pipeline — **Figure 2.** The Data Construction Pipeline of Point-Reason-Instruct. (1) Dual-Stream Preprocessing: objects are sampled into point clouds and rendered into 8 spherical views. (2) Multi-Task Reasoning Generation: the Qwen2.5-VL teacher agent generates hierarchical CoT rationales covering geometric attributes, spatial relations, and functionality. (3) Quality Filtering: rationales are validated against 3D metadata to eliminate hallucinations.

Hierarchical Task Design

The dataset is structured into three cognitive hierarchies to rigorously evaluate reasoning depth, ranging from local perception to abstract deduction:

Level 1 Structural Part Reasoning (~28k samples)

Identification and structural analysis of explicit object components. Queries require recognizing specific parts (e.g., armrests, legs, chassis), counting them, and analyzing geometric integrity. Example: "How many armrests does this chair have, and are they connected to a central axis?"

Level 2 3D Viewpoint Reasoning (~29k samples)

Holistic understanding of 3D structures and spatial perspectives. The model performs mental rotation or infers details of occluded viewpoints (e.g., underside, rear surface) and reasons about relative spatial arrangement. Example: "Describe the geometric structure of the chair's back from the rear view."

Level 3 Functionality and Affordance Reasoning (~29k samples)

Physics-grounded causal reasoning. The model bridges static form and dynamic function, applying principles (gravity, friction, containment) to deduce interactions directly from geometry. Example: "Would this container spill its contents if tilted 45 degrees?"

Strict Object-Level Splitting prevents data leakage: 22,871 objects (79.5%) for training, 2,945 (10.2%) for validation, and 2,944 (10.2%) for testing. Geometric shapes encountered during inference are entirely unseen during training.

Experimental Results

We evaluate PointCoT on the Point-Reason-Instruct benchmark and additionally assess zero-shot generalization on Objaverse-LVIS (open-vocabulary captioning) and ScanQA (complex indoor spatial reasoning). Our architecture integrates PointBERT and EVA-CLIP (ViT-g/14) as encoders with Qwen2.5-7B-Instruct as the LLM backbone. Training runs on 8× NVIDIA A100 (80GB) GPUs.

Main Results on Point-Reason-Instruct

Model	Modality	Backbone (Encoder / LLM)	Overall	Geo.	Spat.	Func.
General-Purpose 2D VLMs (Zero-shot)
GPT-4V	Img	Proprietary / GPT-4	65.4	58.2	68.5	71.2
Qwen2-VL-7B	Img	ViT-L / Qwen2	63.8	54.5	64.2	69.5
LLaVA-1.5	Img	ViT-L / Vicuna-7B	54.1	47.2	51.8	58.9
Specialized 3D-LLMs (Fine-tuned)
Chat-3D v2	PC	ViT-g / Vicuna-7B	66.1	72.4	63.2	62.8
Point-LLM	PC	PointBERT / Vicuna-7B	62.4	68.1	59.2	58.5
Point-Bind	PC	PointBERT / Llama2-7B	58.1	65.2	55.4	52.3
PointCoT (Ours)	PC+Img	PointBERT / Qwen2.5-7B	78.5	82.3	76.4	75.1

Accuracy (%) across three cognitive dimensions on Point-Reason-Instruct. Geo. = structural perception; Spat. = spatial positioning/connectivity; Func. = physics-grounded causal reasoning. PointCoT outperforms the strongest fine-tuned baseline by +12.4%.

Architecture Agnosticism & Scalability

LLM Backbone	Point Encoder	Overall Acc (%)	Reasoning Score (GPT-4, 1–10)
Vicuna-7B	PointBERT	73.4	7.9
Mistral-7B-v0.3	PointBERT	76.8	8.3
Qwen2.5-7B (Ours)	PointBERT	78.5	8.5
Qwen2.5-7B	PointNeXt	79.6	8.7

PointCoT with the weaker Vicuna-7B backbone (73.4%) still significantly outperforms the end-to-end Point-LLM baseline (62.4%), confirming the intrinsic value of explicit rationale generation.

Zero-Shot Generalization & Data Efficiency

Method	Training Data Scale	ScanQA^† (BLEU-4)	Objaverse (Acc %)
3D-LLM	300k (4.3×)	24.5	49.2
Point-LLM	660k (9.6×)	22.4	45.1
PointCoT (Ours)	~69k (1.0×)	23.4	51.8 (+2.6)

Despite training on merely ~69k samples, PointCoT achieves leading 51.8% on Objaverse-LVIS, demonstrating exceptional data efficiency through mastery of localized geometric primitives. ^† Evaluated on localized instance crops aligned with our object-centric regime.

Automated Evaluation of Rationale Quality

Method	Correctness	Logic	Grounding	Average
Point-LLM (CoT-adapted)	7.1	6.8	6.2	6.7
GPT-4V	8.1	8.4	7.6	8.0
PointCoT (Ours)	8.4	8.6	8.9	8.6

GPT-4 evaluates 200 sampled rationales conditioned on ground-truth 3D metadata. PointCoT achieves the highest Grounding score (8.9), confirming deductive chains are firmly anchored in verifiable 3D spatial evidence.

Ablation Studies

(a) Input Modality Ablation

Configuration	Acc (%)	Score
Image Only	55.2	5.1
Point Only	64.8	5.8
Hybrid (Ours)	78.5	8.5

Hybrid modality provides +13.7% over point-only, confirming dual-stream synergy.

(b) Reasoning Strategy Ablation

Strategy	Acc (%)	GHR ↓
Direct Mapping	67.4	25.4%
Implicit CoT	71.2	18.2%
Explicit CoT (Ours)	78.5	5.1%

GHR = Geometric Hallucination Rate. Explicit CoT reduces hallucinations from 25.4% to 5.1%.

Qualitative Analysis

Qualitative Results — **Figure 4.** Qualitative comparison on Point-Reason-Instruct. While the baseline relies on implicit semantic priors and suffers from geometric hallucinations, PointCoT follows a Look–Think–Answer paradigm, actively selecting decisive viewpoints and explicitly grounding rationales in local 3D geometric structures.

Case A — Fine-Grained Structural Reasoning

Q: Is the mascot holding anything in its left hand?

✗ Baseline: "The object is a mascot with two arms and two hands."

✓ PointCoT: Look Scan the palm area in lateral renders and point cloud. Think Curved finger geometry and a separate cylindrical point cluster indicate an object. Answer Yes, holding a stick-like object.

Case B — Topological Detection via Zenith View

Q: Are there any visible openings or gaps on the very top of the windmill's roof?

✗ Baseline: "The windmill structure appears standard in side views."

✓ PointCoT: Look Use the Zenith View to bypass side blade occlusion. Think Top-down geometry confirms a continuous manifold without any structural voids. Answer No, it is a fully closed structure.

Case C — Geometric Profile Estimation via Bottom View

Q: Is the monitor's base circular or rectangular?

✗ Baseline: "This is a standard monitor with a central stand."

✓ PointCoT: Look Examine the stand footprint in Bottom View. Think The geometric boundary clearly shows four sharp corners, forming a rectangular profile. Answer The base is rectangular.

BibTeX

@article{zhang2026pointcot,
  author    = {Zhang, Dongxu and Sun, Yiding and Li, Pengcheng and Liu, Yumou and Lin, Hongqiang
               and Xu, Haoran and Mu, Xiaoxuan and Lin, Liang and Yan, Wenbiao and Yang, Ning
               and Fang, Chaowei and Zhao, Juanjuan and Zhu, Jihua and He, Conghui and Tan, Cheng},
  title     = {PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning},
  journal   = {arXiv preprint arXiv:2602.23945},
  year      = {2026},
}