ExtPose: Robust and Coherent Pose Estimation by Extending ViTs

Abstract

Vision Transformers (ViT) are remarkable at 3D pose estimation, yet they still encounter certain challenges. One issue is that the popular ViT architecture for pose estimation is limited to images and lacks temporal information. Another challenge is that the prediction often fails to maintain pixel alignment with the original images. To address these issues, we propose a systematic framework for 3D pose estimation, called ExtPose. ExtPose extends image ViT to the challenging scenario and video setting by taking in additional 2D pose evidence and capturing temporal information in a full attention-based manner. We use 2D human skeleton images to integrate structured 2D pose information. By sharing parameters and attending across modalities and frames, we enhance the consistency between 3D poses and 2D videos without introducing additional parameters. We achieve state-of-the-art (SOTA) performance on multiple human and hand pose estimation benchmarks with substantial improvements to 34.0mm (-23%) on 3DPW and 4.9mm (-18%) on FreiHAND in PA-MPJPE over the other ViT-based methods respectively.

Cite

Text

Chen et al. "ExtPose: Robust and Coherent Pose Estimation by Extending ViTs." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Chen et al. "ExtPose: Robust and Coherent Pose Estimation by Extending ViTs." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/chen2025icml-extpose/)

BibTeX

@inproceedings{chen2025icml-extpose,
  title     = {{ExtPose: Robust and Coherent Pose Estimation by Extending ViTs}},
  author    = {Chen, Rongyu and Zhuo, Li’An and Yang, Linlin and Wang, Qi and Bo, Liefeng and Zhang, Bang and Yao, Angela},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {9933-9946},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/chen2025icml-extpose/}
}