Utonia is a step toward one-from-all and one-for-all point cloud encoder. It pretrains a single encoder on diverse point cloud data and reuses it as a reliable backbone for downstream tasks. Built on the premise that sparse point clouds are a geometry-first, compact interface to the physical world, Utonia introduces three designs: Causal Modality Blinding, Perceptual Granularity Rescale, and RoPE for Cross-Domain Spatial Encoding, to turn fragmented 3D observations into a shared representation.
We visualizes pretrained features with PCA projections. Each transition performs a progressive zoom-in, moving from a city-scale scene to a street intersection, an indoor room and fine-grained objects. Across various domains, Utonia remains consistent and transferable. It handles city-level large-range scans, outdoor LiDAR with strong sensor-pattern artifacts, dense indoor point clouds, and rotation-invariant object-centric geometry, highlighting strong generalization across granularitiesm, sensors, and coordinate conventions.
We show interactive point cloud visualizations. The raw RGB colorful point cloud is on the left. The PCA processed representation visualization is on the right.
(Mouse wheel to zoom in/out, drag to rotate, ctrl + drag to pan, drag the vertical split line to compare the raw and PCA features)
For more examples, use our inference github code to generate your own visualizations! Check it out
Utonia introduces three critical improvements to the point cloud SSL pipeline. Cross-domain data: jointly training on object-centric, indoor, and outdoor point clouds. RoPE-Enhanced Point Transformer V3: Strengthening spatial encoding and cross-domain transferability via RoPE on granularity-aligned coordinates and domain-prior erasure. Broader evaluation: extending beyond standard perception tasks to spatial reasoning, robotic manipulation, and open-world part segmentation.
Robotics Manipulation: Utonia can separate objects from supporting surfaces and remain coherent under occlusion and partial observations, providing geometry-aware cues that are useful for downstream grasping and motion planning. The quantitative results are provided in the paper, and the qualitative visualization is shown in the figure.
Open-World Object Segmentation: We further evaluate Utonia on open-world 3D object segmentation by building on P3SAM. The figure compares Sonata and Utonia under the same promptable segmentation setting. The detailed quantitative segmentation results are provided in our paper. The Sonata-initialized encoder produces features lacking distinct part semantics. In contrast, the Utonia-initialized encoder generates features with clear part-level structures, enabling highly accurate segmentation with well-defined boundaries and consistent semantic meaning.
Spatial reasoning: We evaluate Sonata, Concerto, and Utonia using the baseline Video-3D LLM on three representative tasks: 3D visual grounding, 3D dense captioning, and 3D question answering. The table shows that Utonia yields consistent gains on grounding and question answering, while remaining competitive on dense captioning.
@inproceedings{zhang2025utonia,
title={Utonia: Toward One Encoder for All Point Clouds},
author={Zhang, Yujia and Wu, Xiaoyang and Yang, Yunhan and Fan, Xianzhe and Li, Han and Zhang, Yuechen and Huang, Zehao and Wang, Naiyan and Zhao, Hengshuang},
journal={arXiv preprint arXiv:},
year={2026}
}