Exploiting Temporal Context for 3D Human Pose Estimation in the Wild

Abstract

We present a bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos. Unlike previous algorithms which operate on single frames, we show that reconstructing a person over an entire sequence gives extra constraints that can resolve ambiguities. This is because videos often give multiple views of a person, yet the overall body shape does not change and 3D positions vary slowly. Our method improves not only on standard mocap-based datasets like Human 3.6M -- where we show quantitative improvements -- but also on challenging in-the-wild datasets such as Kinetics. Building upon our algorithm, we present a new dataset of more than 3 million frames of YouTube videos from Kinetics with automatically generated 3D poses and meshes. We show that retraining a single-frame 3D pose estimator on this data improves accuracy on both real-world and mocap data by evaluating on the 3DPW and HumanEVA datasets.

Cite

Text

Arnab et al. "Exploiting Temporal Context for 3D Human Pose Estimation in the Wild." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. doi:10.1109/CVPR.2019.00351

Markdown

[Arnab et al. "Exploiting Temporal Context for 3D Human Pose Estimation in the Wild." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.](https://mlanthology.org/cvpr/2019/arnab2019cvpr-exploiting/) doi:10.1109/CVPR.2019.00351

BibTeX

@inproceedings{arnab2019cvpr-exploiting,
  title     = {{Exploiting Temporal Context for 3D Human Pose Estimation in the Wild}},
  author    = {Arnab, Anurag and Doersch, Carl and Zisserman, Andrew},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2019},
  doi       = {10.1109/CVPR.2019.00351},
  url       = {https://mlanthology.org/cvpr/2019/arnab2019cvpr-exploiting/}
}