To democratize research on explicit 3D CoT reasoning, we construct
Point-Reason-Instruct—the first large-scale dataset combining 3D point
clouds with explicit CoT annotations. Unlike previous benchmarks that provide simple
input-output pairs, each sample in our dataset provides a triplet of
⟨Point Cloud, Multi-view Images, CoT Rationale⟩,
enabling models to learn how to reason, not just what to answer.
The dataset is publicly available on
HuggingFace.
Data Sourcing. We curate objects from the
Objaverse-LVIS subset using
a topology-aware filtering protocol that prioritizes objects with complex geometries and
articulated parts while filtering out overly simplistic shapes. For each object, point clouds
are generated by sampling N = 8,192 points via Farthest Point Sampling (FPS).
Holistic Multi-View Rendering. We design a Spherical 8-View System
for holistic coverage: six equidistant azimuth angles (0°–300°) at 30°
elevation, plus zenith and nadir views. The orthogonal views capture cavities, inner surfaces,
and supporting mechanisms (e.g., chassis) that are typically occluded in standard horizontal
scans, enabling reasoning about object affordances such as containability.
Geometry-Aware Teacher Agent. Scalable annotation is driven by
Qwen2.5-VL-72B-Instruct, which generates hierarchical CoT rationales across three task
types: geometric attribute reasoning, spatial relation reasoning, and functionality &
affordance reasoning. A deterministic cross-validation protocol rigorously
cross-references every spatial assertion against rigid 3D object metadata and multi-view
consistency constraints, discarding candidates that fail geometric verification to guarantee
that the final corpus is anchored in physical 3D truth.
The dataset is structured into three cognitive hierarchies to rigorously evaluate reasoning
depth, ranging from local perception to abstract deduction:
Level 1
Structural Part Reasoning (~28k samples)
Identification and structural analysis of explicit object components. Queries require
recognizing specific parts (e.g., armrests, legs, chassis), counting them, and analyzing
geometric integrity.
Example: "How many armrests does this chair have, and are they connected to a central axis?"
Level 2
3D Viewpoint Reasoning (~29k samples)
Holistic understanding of 3D structures and spatial perspectives. The model performs
mental rotation or infers details of occluded viewpoints (e.g., underside, rear surface)
and reasons about relative spatial arrangement.
Example: "Describe the geometric structure of the chair's back from the rear view."
Level 3
Functionality and Affordance Reasoning (~29k samples)
Physics-grounded causal reasoning. The model bridges static form and dynamic function,
applying principles (gravity, friction, containment) to deduce interactions directly
from geometry.
Example: "Would this container spill its contents if tilted 45 degrees?"
Strict Object-Level Splitting prevents data leakage: 22,871 objects (79.5%)
for training, 2,945 (10.2%) for validation, and 2,944 (10.2%) for testing. Geometric shapes
encountered during inference are entirely unseen during training.