Self-Supervised Human Depth Estimation from Monocular Videos
Abstract
Previous methods on estimating detailed human depth often require supervised training with 'ground truth' depth data. This paper presents a self-supervised method that can be trained on YouTube videos without known depth, which makes training data collection simple and improves the generalization of the learned network. The self-supervised learning is achieved by minimizing a photo-consistency loss, which is evaluated between a video frame and its neighboring frames warped according to the estimated depth and the 3D non-rigid motion of the human body. To solve this non-rigid motion, we first estimate a rough SMPL model at each video frame and compute the non-rigid body motion accordingly, which enables self-supervised learning on estimating the shape details. Experiments demonstrate that our method enjoys better generalization, and performs much better on data in the wild.
Cite
Text
Tan et al. "Self-Supervised Human Depth Estimation from Monocular Videos." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. doi:10.1109/CVPR42600.2020.00073Markdown
[Tan et al. "Self-Supervised Human Depth Estimation from Monocular Videos." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.](https://mlanthology.org/cvpr/2020/tan2020cvpr-selfsupervised/) doi:10.1109/CVPR42600.2020.00073BibTeX
@inproceedings{tan2020cvpr-selfsupervised,
title = {{Self-Supervised Human Depth Estimation from Monocular Videos}},
author = {Tan, Feitong and Zhu, Hao and Cui, Zhaopeng and Zhu, Siyu and Pollefeys, Marc and Tan, Ping},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2020},
doi = {10.1109/CVPR42600.2020.00073},
url = {https://mlanthology.org/cvpr/2020/tan2020cvpr-selfsupervised/}
}