Utonia:

Toward One Encoder for All Point Clouds

Paper 🤗 Demo GitHub

Yujia Zhang¹, Xiaoyang Wu¹, Yunhan Yang¹, Xianzhe Fan¹, Han Li³,
Yuechen Zhang², Zehao Huang³, Naiyan Wang³, Hengshuang Zhao¹

¹The University of Hong Kong, ²The Chinese University of Hong Kong, ³Xiaomi

Utonia is a step toward one-from-all and one-for-all point cloud encoder. It pretrains a single encoder on diverse point cloud data and reuses it as a reliable backbone for downstream tasks. Built on the premise that sparse point clouds are a geometry-first, compact interface to the physical world, Utonia introduces three designs: Causal Modality Blinding, Perceptual Granularity Rescale, and RoPE for Cross-Domain Spatial Encoding, to turn fragmented 3D observations into a shared representation.

Video

We visualizes pretrained features with PCA projections. Each transition performs a progressive zoom-in, moving from a city-scale scene to a street intersection, an indoor room and fine-grained objects. Across various domains, Utonia remains consistent and transferable. It handles city-level large-range scans, outdoor LiDAR with strong sensor-pattern artifacts, dense indoor point clouds, and rotation-invariant object-centric geometry, highlighting strong generalization across granularitiesm, sensors, and coordinate conventions.

Gallery

We show interactive point cloud visualizations. The raw RGB colorful point cloud is on the left. The PCA processed representation visualization is on the right.
(Mouse wheel to zoom in/out, drag to rotate, ctrl + drag to pan, drag the vertical split line to compare the raw and PCA features)

Raw | PCA

Raw

PCA

Raw

PCA

Raw

PCA

Raw

PCA

For more examples, use our inference github code to generate your own visualizations! Check it out

Abstract

We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across heterogeneous domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.

What Prevents Unification?

Sensitivity to granularity shifts.

In sparse point processing, grid size sets the metric unit of local neighborhoods. The same operator may span centimeters in one domain but meters in another, so granularity shifts change neighborhood statistics and topology and couple features to domain-specific scale. As shown in the table, settings that work per-domain become fragile under joint training. This motivates granularity alignment for joint pretraining. Inspired by an observer's roughly fixed minimal resolution, we rescale each point cloud to a canonical observing granularity, so positional hints and interactions are built on comparable spatial units across domains.

Bias toward gravity convention.

Many scene-level point clouds with a coarser granularity follow a gravity-aligned convention, making height a physical reference. Accordingly, prior scene-centric training usually avoids strong x/y rotations, since many semantics depend on gravity, such supporting surfaces. However, height can act as a domain cue, which hurts transfer to object-centric scans with a fine granularity. In the figure, features from Sonata and Concerto exhibit strong z-correlated patterns, while Utonia weakens this correlation. This suggests treating gravity alignment as a granularity-dependent prior: retaining upright structure for scene-scale scans while encouraging rotation invariance for fine-grained objects.

Inconsistent modality availability.

Point clouds across domains expose different auxiliary channels beyond coordinates: colors and normals. In naive multi-domain pretraining, the encoder tends to exploit these channels whenever they exist, since they provide strong cues that can complement coordinates. However, the resulting dependence is unstable: when these modalities are missing, noisy, or defined differently, representations and performance can degrade. Intuitively, it resembles a person who can walk well with vision but stumbles when suddenly blindfolded. This motivates a specific design on modality availability during pretraining, enabling the model to benefit from optional modalities when present while remaining robust when they are absent.

Pipeline of Utonia

Utonia introduces three critical improvements to the point cloud SSL pipeline. Cross-domain data: jointly training on object-centric, indoor, and outdoor point clouds. RoPE-Enhanced Point Transformer V3: Strengthening spatial encoding and cross-domain transferability via RoPE on granularity-aligned coordinates and domain-prior erasure. Broader evaluation: extending beyond standard perception tasks to spatial reasoning, robotic manipulation, and open-world part segmentation.

Applications

Robotics Manipulation: Utonia can separate objects from supporting surfaces and remain coherent under occlusion and partial observations, providing geometry-aware cues that are useful for downstream grasping and motion planning. The quantitative results are provided in the paper, and the qualitative visualization is shown in the figure.

Utonia Robotics Manipulation Application

Open-World Object Segmentation: We further evaluate Utonia on open-world 3D object segmentation by building on P³SAM. The figure compares Sonata and Utonia under the same promptable segmentation setting. The detailed quantitative segmentation results are provided in our paper. The Sonata-initialized encoder produces features lacking distinct part semantics. In contrast, the Utonia-initialized encoder generates features with clear part-level structures, enabling highly accurate segmentation with well-defined boundaries and consistent semantic meaning.

Utonia Open-world Object Segmentation Application

Spatial reasoning: We evaluate Sonata, Concerto, and Utonia using the baseline Video-3D LLM on three representative tasks: 3D visual grounding, 3D dense captioning, and 3D question answering. The table shows that Utonia yields consistent gains on grounding and question answering, while remaining competitive on dense captioning.

Citation


@inproceedings{zhang2025utonia,
  title={Utonia: Toward One Encoder for All Point Clouds},
  author={Zhang, Yujia and Wu, Xiaoyang and Yang, Yunhan and Fan, Xianzhe and Li, Han and Zhang, Yuechen and Huang, Zehao and Wang, Naiyan and Zhao, Hengshuang},
  journal={arXiv preprint arXiv:},
  year={2026}
}