Ego-Humans: An Ego-Centric 3D Multi-Human Benchmark
Abstract
We present EgoHumans, a new multi-view multi-human video benchmark to advance the state-of-the-art of egocentric human 3D pose estimation and tracking. Existing egocentric benchmarks either capture single subject or indoor-only scenarios, which limit the generalization of computer vision algorithms for real-world applications. We propose a novel 3D capture setup to construct a comprehensive egocentric multi-human benchmark in the wild with annotations to support diverse tasks such as human detection, tracking, 2D/3D pose estimation, and mesh recovery. We leverage consumer-grade wearable camera-equipped glasses for the egocentric view, which enables us to capture dynamic activities like playing tennis, fencing, volleyball, etc. Furthermore, our multi-view setup generates accurate 3D ground truth even under severe or complete occlusion. The dataset consists of more than 125k egocentric images, spanning diverse scenes with a particular focus on challenging and unchoreographed multi-human activities and fast-moving egocentric views. We rigorously evaluate existing state-of-the-art methods and highlight their limitations in the egocentric scenario, specifically on multi-human tracking. To address such limitations, we propose EgoFormer, a novel approach with a multi-stream transformer architecture and explicit 3D spatial reasoning to estimate and track the human pose. EgoFormer significantly outperforms prior art by 13.6% IDF1 on the EgoHumans dataset.
Cite
Text
Khirodkar et al. "Ego-Humans: An Ego-Centric 3D Multi-Human Benchmark." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01814Markdown
[Khirodkar et al. "Ego-Humans: An Ego-Centric 3D Multi-Human Benchmark." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/khirodkar2023iccv-egohumans/) doi:10.1109/ICCV51070.2023.01814BibTeX
@inproceedings{khirodkar2023iccv-egohumans,
title = {{Ego-Humans: An Ego-Centric 3D Multi-Human Benchmark}},
author = {Khirodkar, Rawal and Bansal, Aayush and Ma, Lingni and Newcombe, Richard and Vo, Minh and Kitani, Kris},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {19807-19819},
doi = {10.1109/ICCV51070.2023.01814},
url = {https://mlanthology.org/iccv/2023/khirodkar2023iccv-egohumans/}
}