Estimating Body and Hand Motion in an Ego-Sensed World

Abstract

We present EgoAllo, a system for human motion estimation from a head-mounted device. Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters that capture a device wearer's actions in the allocentric coordinate frame of the scene. To achieve this, our key insight is in representation: we propose spatial and temporal invariance criteria for improving model performance, from which we derive a head motion conditioning parameterization that improves estimation by up to 18%. We also show how the bodies estimated by our system can improve hand estimation: the resulting kinematic and temporal constraints can reduce world-frame errors in single-frame estimates by 40%.

Cite

Text

Yi et al. "Estimating Body and Hand Motion in an Ego-Sensed World." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00663

Markdown

[Yi et al. "Estimating Body and Hand Motion in an Ego-Sensed World." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/yi2025cvpr-estimating/) doi:10.1109/CVPR52734.2025.00663

BibTeX

@inproceedings{yi2025cvpr-estimating,
  title     = {{Estimating Body and Hand Motion in an Ego-Sensed World}},
  author    = {Yi, Brent and Ye, Vickie and Zheng, Maya and Li, Yunqi and Müller, Lea and Pavlakos, Georgios and Ma, Yi and Malik, Jitendra and Kanazawa, Angjoo},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {7072-7084},
  doi       = {10.1109/CVPR52734.2025.00663},
  url       = {https://mlanthology.org/cvpr/2025/yi2025cvpr-estimating/}
}