HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding
Abstract
Understanding comprehensive assembly knowledge from videos is critical for futuristic ultra-intelligent industry. To enable technological breakthrough, we present HA-ViD – the first human assembly video dataset that features representative industrial assembly scenarios, natural procedural knowledge acquisition process, and consistent human-robot shared annotations. Specifically, HA-ViD captures diverse collaboration patterns of real-world assembly, natural human behaviors and learning progression during assembly, and granulate action annotations to subject, action verb, manipulated object, target object, and tool. We provide 3222 multi-view and multi-modality videos), 1.5M frames, 96K temporal labels and 2M spatial labels. We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking. Importantly, we analyze their performance and the further reasoning steps for comprehending knowledge in assembly progress, process efficiency, task collaboration, skill parameters and human intention. Details of HA-ViD is available at: https://iai-hrc.github.io/ha-vid.
Cite
Text
Zheng et al. "HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding." Neural Information Processing Systems, 2023.Markdown
[Zheng et al. "HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/zheng2023neurips-havid/)BibTeX
@inproceedings{zheng2023neurips-havid,
title = {{HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding}},
author = {Zheng, Hao and Lee, Regina and Lu, Yuqian},
booktitle = {Neural Information Processing Systems},
year = {2023},
url = {https://mlanthology.org/neurips/2023/zheng2023neurips-havid/}
}