Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations

Abstract

Learning effective multi-modal 3D representations of objects is essential for numerous applications, such as augmented reality and robotics. Existing methods often rely on task-specific embeddings that are tailored either for semantic understanding or geometric reconstruction. As a result, these embeddings typically cannot be decoded into explicit geometry and simultaneously reused across tasks. In this paper, we propose Object-X, a versatile multi-modal object representation framework capable of encoding rich object embeddings (e.g., images, point cloud, text) and decoding them back into detailed geometric and visual reconstructions. Object-X operates by geometrically grounding the captured modalities in a 3D voxel grid and learning an unstructured embedding fusing the information from the voxels with the object attributes. The learned embedding enables 3D Gaussian Splatting-based object reconstruction, while also supporting a range of downstream tasks, including scene alignment, single-image 3D object reconstruction, and localization. Evaluations on two challenging real-world datasets demonstrate that Object-X produces high-fidelity novel-view synthesis comparable to standard 3D Gaussian Splatting, while significantly improving geometric accuracy. Moreover, Object-X achieves competitive performance with specialized methods in scene alignment and localization. Critically, our object-centric descriptors require 3-4 orders of magnitude less storage compared to traditional image- or point cloud-based approaches, establishing Object-X as a scalable and highly practical solution for multi-modal 3D scene representation.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Di Lorenzo et al. "Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations." Advances in Neural Information Processing Systems, 2025.

Markdown

[Di Lorenzo et al. "Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/lorenzo2025neurips-objectx/)

BibTeX

@inproceedings{lorenzo2025neurips-objectx,
  title     = {{Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations}},
  author    = {Di Lorenzo, Gaia and Tombari, Federico and Pollefeys, Marc and Barath, Daniel},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/lorenzo2025neurips-objectx/}
}