POPE: 6-DoF Promptable Pose Estimation of Any Object, in Any Scene, with One Reference

Abstract

Despite the significant progress in six degrees-of-freedom (6DoF) object pose estimation, existing methods have limited applicability in real-world scenarios involving embodied agents and downstream 3D vision tasks. These limitations mainly come from the necessity of 3D models, closed- category detection, and a large number of densely annotated support views. To mitigate this issue, we propose a general paradigm for object pose estimation, called Promptable Object Pose Estimation (POPE). The proposed approach POPE enables zero-shot 6DoF object pose estimation for any target object in any scene, while only a single reference is adopted as the support view. To achieve this, POPE leverages the power of the pre-trained large-scale 2D foundation model, employs a framework with hierarchical feature representation and 3D geometry principles. Moreover, it estimates the relative camera pose between object prompts and the target object in new views, enabling both two-view and multiview 6DoF pose estimation tasks. Comprehensive experimental results demonstrate that POPE exhibits unrivaled robust performance in zero-shot settings, by achieving a significant reduction in the averaged Median Pose Error by 52.38% and 50.47% on the LINEMOD [22] and OnePose [54] datasets, respectively. We also conduct more challenging testings in causally captured images (see Figure 1), which further demonstrates the robustness of POPE.

Cite

Text

Fan et al. "POPE: 6-DoF Promptable Pose Estimation of Any Object, in Any Scene, with One Reference." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00773

Markdown

[Fan et al. "POPE: 6-DoF Promptable Pose Estimation of Any Object, in Any Scene, with One Reference." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/fan2024cvprw-pope/) doi:10.1109/CVPRW63382.2024.00773

BibTeX

@inproceedings{fan2024cvprw-pope,
  title     = {{POPE: 6-DoF Promptable Pose Estimation of Any Object, in Any Scene, with One Reference}},
  author    = {Fan, Zhiwen and Pan, Panwang and Wang, Peihao and Jiang, Yifan and Xu, Dejia and Wang, Zhangyang},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {7771-7781},
  doi       = {10.1109/CVPRW63382.2024.00773},
  url       = {https://mlanthology.org/cvprw/2024/fan2024cvprw-pope/}
}