HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

Abstract

We introduce a data capture system and a new dataset, HO-Cap, for 3D reconstruction and pose tracking of hands and objects in videos. The system leverages multiple RGB-D cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or motion capture systems. We propose a semiautomatic method for annotating the shape and pose of hands and objects in the collected videos, significantly reducing the annotation time and cost compared to manual labeling. With this system, we captured a video dataset of humans performing various single- and dual-hand manipulation tasks, including simple pick-and-place actions, handovers between hands, and using objects according to their affordance. This dataset can serve as human demonstrations for research in embodied AI and robot manipulation. Our capture setup and annotation framework will be made available to the community for reconstructing 3D shapes of objects and human hands, as well as tracking their poses in videos.

Cite

Text

Wang et al. "HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction." Advances in Neural Information Processing Systems, 2025.

Markdown

[Wang et al. "HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/wang2025neurips-hocap/)

BibTeX

@inproceedings{wang2025neurips-hocap,
  title     = {{HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction}},
  author    = {Wang, Jikai and Zhang, Qifan and Chao, Yu-Wei and Wen, Bowen and Guo, Xiaohu and Xiang, Yu},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/wang2025neurips-hocap/}
}