One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6d Pose Estimation

Abstract

Estimating the 6D pose of arbitrary objects from a single reference image is a critical yet challenging task in robotics, especially considering the long-tail distribution of real-world instances. While category-level and model-based approaches have achieved notable progress, they remain limited in generalizing to unseen objects under one-shot settings. In this work, we propose a novel pipeline for fast and accurate one-shot 6D pose and scale estimation. Leveraging recent advances in single-view 3D generation, we first build high-fidelity textured meshes without requiring known object poses. To resolve scale ambiguity, we introduce a coarse-to-fine alignment module that estimates both object size and initial pose by matching 2D-3D features with depth information. We then generate a diversified set of plausible 3D models using text-guided generative augmentation and render them with Blender to synthesize large-scale, domain-randomized training data for pose estiamtion. This synthetic data bridges the domain gap and enables robust fine-tuning of pose estimators. Our method achieves state-of-the-art results on several 6D pose benchmarks, and we further validate its effectiveness on a newly collected in-the-wild dataset. Finally, we integrate our system with a dexterous hand, demonstrating its robustness in real-world robotic grasping tasks. All code, data, and models will be released to foster future research.

Cite

Text

Geng et al. "One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6d Pose Estimation." Proceedings of The 9th Conference on Robot Learning, 2025.

Markdown

[Geng et al. "One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6d Pose Estimation." Proceedings of The 9th Conference on Robot Learning, 2025.](https://mlanthology.org/corl/2025/geng2025corl-one/)

BibTeX

@inproceedings{geng2025corl-one,
  title     = {{One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6d Pose Estimation}},
  author    = {Geng, Zheng and Wang, Nan and Xu, Shaocong and Ye, Chongjie and Li, Bohan and Chen, Zhaoxi and Peng, Sida and Zhao, Hao},
  booktitle = {Proceedings of The 9th Conference on Robot Learning},
  year      = {2025},
  pages     = {168-197},
  volume    = {305},
  url       = {https://mlanthology.org/corl/2025/geng2025corl-one/}
}