Sapiens: Foundation for Human Vision Models

Abstract

We present Sapiens, a family of models for four fundamental human-centric vision tasks – 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning foundation models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability – model performance across tasks significantly improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing complex baselines across various human-centric benchmarks. Specifically, we achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error.

Cite

Text

Khirodkar et al. "Sapiens: Foundation for Human Vision Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73235-5_12

Markdown

[Khirodkar et al. "Sapiens: Foundation for Human Vision Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/khirodkar2024eccv-sapiens/) doi:10.1007/978-3-031-73235-5_12

BibTeX

@inproceedings{khirodkar2024eccv-sapiens,
  title     = {{Sapiens: Foundation for Human Vision Models}},
  author    = {Khirodkar, Rawal and Bagautdinov, Timur and Martinez, Julieta and Su, Zhaoen and James, Austin T and Selednik, Peter and Anderson, Stuart and Saito, Shunsuke},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73235-5_12},
  url       = {https://mlanthology.org/eccv/2024/khirodkar2024eccv-sapiens/}
}