Cross-View and Cross-Pose Completion for 3D Human Understanding
Abstract
Human perception and understanding is a major domain of computer vision which like many other vision subdomains recently stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose object-centric image datasets such as ImageNet is limited by an important domain shift. On the other hand collecting domain-specific ground truth such as 2D or 3D labels does not scale well. Therefore we propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. Our method uses pairs of images of humans: the first is partially masked and the model is trained to reconstruct the masked parts given the visible ones and a second image. It relies on both stereoscopic (cross-view) pairs and temporal (cross-pose) pairs taken from videos in order to learn priors about 3D as well as human motion. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks and obtain state-of-the-art performance for instance when fine-tuning for model-based and model-free human mesh recovery.
Cite
Text
Armando et al. "Cross-View and Cross-Pose Completion for 3D Human Understanding." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00150Markdown
[Armando et al. "Cross-View and Cross-Pose Completion for 3D Human Understanding." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/armando2024cvpr-crossview/) doi:10.1109/CVPR52733.2024.00150BibTeX
@inproceedings{armando2024cvpr-crossview,
title = {{Cross-View and Cross-Pose Completion for 3D Human Understanding}},
author = {Armando, Matthieu and Galaaoui, Salma and Baradel, Fabien and Lucas, Thomas and Leroy, Vincent and Brégier, Romain and Weinzaepfel, Philippe and Rogez, Grégory},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {1512-1523},
doi = {10.1109/CVPR52733.2024.00150},
url = {https://mlanthology.org/cvpr/2024/armando2024cvpr-crossview/}
}