X-Pose: Detecting Any Keypoints

Abstract

This work aims to address an advanced keypoint detection problem: how to accurately detect any keypoints in complex real-world scenarios, which involves massive, messy, and open-ended objects as well as their associated keypoints definitions. Current high-performance keypoint detectors often fail to tackle this problem due to their two-stage schemes, under-explored prompt designs, and limited training data. To bridge the gap, we propose , a novel end-to-end framework with multi-modal (i.e., visual, textual, or their combinations) prompts to detect multi-object keypoints for any articulated (e.g., human and animal), rigid, and soft objects within a given image. Moreover, we introduce a large-scale dataset called , which unifies 13 keypoint detection datasets with 338 keypoints across 1, 237 categories over 400K instances. Training with , effectively aligns text-to-keypoint and image-to-keypoint due to the mutual enhancement of multi-modal prompts based on cross-modality contrastive learning. Our experimental results demonstrate that achieves notable improvements of 27.7 AP, 6.44 PCK, and 7.0 AP compared to state-of-the-art non-promptable, visual prompt-based, and textual prompt-based methods in each respective fair setting. More importantly, the in-the-wild test demonstrates ’s strong fine-grained keypoint localization and generalization abilities across image styles, object categories, and poses, paving a new path to multi-object keypoint detection in real applications.

Cite

Text

Yang et al. "X-Pose: Detecting Any Keypoints." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72952-2_15

Markdown

[Yang et al. "X-Pose: Detecting Any Keypoints." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/yang2024eccv-xpose/) doi:10.1007/978-3-031-72952-2_15

BibTeX

@inproceedings{yang2024eccv-xpose,
  title     = {{X-Pose: Detecting Any Keypoints}},
  author    = {Yang, Jie and Zeng, Ailing and Zhang, Ruimao and Zhang, Lei},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72952-2_15},
  url       = {https://mlanthology.org/eccv/2024/yang2024eccv-xpose/}
}