Concerto

Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Paper 🤗 Demo GitHub

Yujia Zhang¹, Xiaoyang Wu¹, Yixing Lao¹, Chengyao Wang², Zhuotao Tian³, Naiyan Wang, Hengshuang Zhao¹

¹The University of Hong Kong, ²The Chinese University of Hong Kong, ³Harbin Institute of Technology (Shenzhen)

We are excited to present Concerto 🎶, a superior model for 3D representations, simulating human concept learning process for spatial cognition and combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding.

Video

We present PCA visualizations of Concerto's inference on point cloud and video data, comparing the raw input (left) to the processed result (right). By employing joint 2D-3D self-supervised learning, Concerto effectively unlocks the potential of large-scale unlabeled point cloud datasets. With current feed-forward reconstruction methods compacting videos with spatial prior knowledge, Concerto exhibits strong performance on video-lifted point cloud, paving the way to lifted spatial intelligence. With oceans of unlabeled video data online, we can obtain oceans of opportunities with Concerto.

Gallery

We show interactive point cloud visualizations for both point cloud and video data. The raw RGB colorful point cloud is on the left. The PCA processed representation visualization is on the right.
(Mouse wheel to zoom in/out, drag to rotate, ctrl + drag to pan)

Raw

PCA

Abstract

Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP’s language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.

Beyond Single Modality

Start from human cognition learning:

Our inspiration toward this target is rooted in how humans learn abstract concepts: through multisensory synergy. Consider the example of an apple (as illustrated in right)—our understanding of it is formed through repeatedly seeing, touching, and tasting apples, allowing us to internalize its geometry, texture, and semantic meaning in a unified, predictive way (right top). Yet once such a representation is formed, it can be evoked from just a single modality: seeing an image of an apple can vividly recall its weight and texture (right bottom). This ability to retrieve rich, structured knowledge from partial sensory input underscores the importance of learning modality-agnostic representations that are both unified and predictive.

Towards a Superior Representation by Joint Multi-Modal Learning:

Inspired by this principle, we believe it is similar to leverage the synergy of self-supervised learning on 2D images and 3D point clouds. We begin with a pilot experiment: fusing self-supervised features from image model DINOv2 and point cloud model Sonata to benchmark the 2D, 3D, and fused representations via linear probing on ScanNet (detailed implementations can be found in our paper). Notably, this naive combination outperforms both individual modalities, suggesting the presence of complementary information and hinting at a richer representational space if the synergy that emerges are fully captured when modalities are learned together.

Pipeline of Concerto

Concerto simulates human multisensory synergy by coupling
(a) intra-modal self-distillation on 3D point clouds to progressively refine its internal spatial representations, and
(b) cross-modal joint embedding prediction that aligns point features with corresponding image patch features using camera parameters.

Applications

Language Probing: We demonstrate Concerto's ability to formulate concepts similar to human language, paving the way for future exploration of alignment with text-based semantic spaces. With linear probing, we translate Concerto's representations to language space. Below is the visualizations of object locating by inputting specific words to Concerto.

Citation


@inproceedings{zhang2025concerto,
  title={Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations},
  author={Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao},
  booktitle={NeurIPS},
  year={2025}
}