Vision Foundation Model Enables Generalizable Object Pose Estimation
Abstract
Object pose estimation plays a crucial role in robotic manipulation, however, its practical applicability still suffers from limited generalizability. This paper addresses the challenge of generalizable object pose estimation, particularly focusing on category-level object pose estimation for unseen object categories. Current methods either require impractical instance-level training or are confined to predefined categories, limiting their applicability. We propose VFM-6D, a novel framework that explores harnessing existing vision and language models, to elaborate object pose estimation into two stages: category-level object viewpoint estimation and object coordinate map estimation. Based on the two-stage framework, we introduce a 2D-to-3D feature lifting module and a shape-matching module, both of which leverage pre-trained vision foundation models to improve object representation and matching accuracy. VFM-6D is trained on cost-effective synthetic data and exhibits superior generalization capabilities. It can be applied to both instance-level unseen object pose estimation and category-level object pose estimation for novel categories. Evaluations on benchmark datasets demonstrate the effectiveness and versatility of VFM-6D in various real-world scenarios.
Cite
Text
Chen et al. "Vision Foundation Model Enables Generalizable Object Pose Estimation." Neural Information Processing Systems, 2024. doi:10.52202/079017-0630Markdown
[Chen et al. "Vision Foundation Model Enables Generalizable Object Pose Estimation." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/chen2024neurips-vision/) doi:10.52202/079017-0630BibTeX
@inproceedings{chen2024neurips-vision,
title = {{Vision Foundation Model Enables Generalizable Object Pose Estimation}},
author = {Chen, Kai and Ma, Yiyao and Lin, Xingyu and James, Stephen and Zhou, Jianshu and Liu, Yun-Hui and Abbeel, Pieter and Dou, Qi},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-0630},
url = {https://mlanthology.org/neurips/2024/chen2024neurips-vision/}
}