Coordinate Transformer: Achieving Single-Stage Multi-Person Mesh Recovery from Videos

Abstract

Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond. However, existing approaches rely on multi-stage paradigms, where the person detection and tracking stages are performed in a multi-person setting, while temporal dynamics are only modeled for one person at a time. Consequently, their performance is severely limited by the lack of inter-person interactions in the spatial-temporal mesh recovery, as well as by detection and tracking defects. To address these challenges, we propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner. Instead of partitioning the feature map into coarse-scale patch-wise tokens, CoordFormer leverages a novel Coordinate-Aware Attention to preserve pixel-level spatial-temporal coordinate information. Additionally, we propose a simple, yet effective Body Center Attention mechanism to fuse position information. Extensive experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE, and PVE metrics, respectively, while being 40% faster than recent video-based approaches. The released code can be found at https://github.com/Li-Hao-yuan/CoordFormer.

Cite

Text

Li et al. "Coordinate Transformer: Achieving Single-Stage Multi-Person Mesh Recovery from Videos." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00803

Markdown

[Li et al. "Coordinate Transformer: Achieving Single-Stage Multi-Person Mesh Recovery from Videos." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/li2023iccv-coordinate/) doi:10.1109/ICCV51070.2023.00803

BibTeX

@inproceedings{li2023iccv-coordinate,
  title     = {{Coordinate Transformer: Achieving Single-Stage Multi-Person Mesh Recovery from Videos}},
  author    = {Li, Haoyuan and Dong, Haoye and Jia, Hanchao and Huang, Dong and Kampffmeyer, Michael C. and Lin, Liang and Liang, Xiaodan},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {8744-8753},
  doi       = {10.1109/ICCV51070.2023.00803},
  url       = {https://mlanthology.org/iccv/2023/li2023iccv-coordinate/}
}