SVTformer: Spatial-View-Temporal Transformer for Multi-View 3D Human Pose Estimation

Zhang, Wanruo; Liu, Mengyuan; Liu, Hong; Li, Wenhao

doi:10.1609/AAAI.V39I10.33101

SVTformer: Spatial-View-Temporal Transformer for Multi-View 3D Human Pose Estimation

Wanruo Zhang, Mengyuan Liu, Hong Liu, Wenhao Li

AAAI 2025 pp. 10148-10156

doi:10.1609/AAAI.V39I10.33101 /aaai/2025/zhang2025aaai-svtformer/

Abstract

Recently, transformer-based methods have been introduced to estimate 3D human pose from multiple views by aggregating the spatial-temporal information of human joints to achieve the lifting of 2D to 3D. However, previous approaches cannot model the inter-frame correspondence of each view's joint individually, nor can they directly consider all view interactions at each time, leading to insufficient learning of multi-view associations. To address this issue, we propose a Spatial-View-Temporal transformer (SVTformer) to decouple spatial-view-temporal information in sequential order for correlation learning and model dependencies between them in a local-to-global manner. SVTformer includes an attended Spatial-View-Temporal (SVT) patch embedding to attentively capture the local features of the input poses and stacked SVT encoders to extract global spatial-view-temporal dependencies progressively. Specifically, SVT encoders perform three reconstructions sequentially to attended features with the learning through view decoupling for temporal-enhanced spatial correlation, temporal decoupling for spatial-enhanced view correlation, and another view decoupling for spatial-enhanced temporal relationship. This decoupling-coupling-decoupling multi-view scheme enables us to alternatively model the inter-joint spatial relationships, cross-view dependencies, and temporal motion associations. We evaluate the proposed SVTformer on three popular 3D HPE datasets, and it yields state-of-the-art performance. It effectively deals with ill-posed problems and enhances the accuracy of 3D human pose estimation.

PDF AAAI Semantic Scholar

Cite

Text

Zhang et al. "SVTformer: Spatial-View-Temporal Transformer for Multi-View 3D Human Pose Estimation." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I10.33101

Markdown

[Zhang et al. "SVTformer: Spatial-View-Temporal Transformer for Multi-View 3D Human Pose Estimation." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zhang2025aaai-svtformer/) doi:10.1609/AAAI.V39I10.33101

BibTeX

@inproceedings{zhang2025aaai-svtformer,
  title     = {{SVTformer: Spatial-View-Temporal Transformer for Multi-View 3D Human Pose Estimation}},
  author    = {Zhang, Wanruo and Liu, Mengyuan and Liu, Hong and Li, Wenhao},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {10148-10156},
  doi       = {10.1609/AAAI.V39I10.33101},
  url       = {https://mlanthology.org/aaai/2025/zhang2025aaai-svtformer/}
}