We are excited to present Concerto 🎶, a superior model for 3D representations, simulating human concept learning process for spatial cognition and combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding.
We present PCA visualizations of Concerto's inference on point cloud and video data, comparing the raw input (left) to the processed result (right). By employing joint 2D-3D self-supervised learning, Concerto effectively unlocks the potential of large-scale unlabeled point cloud datasets. With current feed-forward reconstruction methods compacting videos with spatial prior knowledge, Concerto exhibits strong performance on video-lifted point cloud, paving the way to lifted spatial intelligence. With oceans of unlabeled video data online, we can obtain oceans of opportunities with Concerto.
We show interactive point cloud visualizations for both point cloud and video data. The raw RGB colorful point cloud is on the left. The PCA processed representation visualization is on the right.
(Mouse wheel to zoom in/out, drag to rotate, ctrl + drag to pan)
Concerto simulates human multisensory synergy by coupling
(a) intra-modal self-distillation on 3D point clouds to progressively refine its internal spatial representations, and
(b) cross-modal joint embedding prediction that aligns point features with corresponding image patch features using camera parameters.
Language Probing: We demonstrate Concerto's ability to formulate concepts similar to human language, paving the way for future exploration of alignment with text-based semantic spaces. With linear probing, we translate Concerto's representations to language space. Below is the visualizations of object locating by inputting specific words to Concerto.
@inproceedings{zhang2025concerto,
title={Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations},
author={Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao},
booktitle={NeurIPS},
year={2025}
}