DISTFormer: Enhance 3D Human Pose Estimation via Dual Inverse-Order Spatial-Temporal Transformer

Li, Ruidong; Huo, Hua

doi:10.1007/S10994-025-06938-3

DISTFormer: Enhance 3D Human Pose Estimation via Dual Inverse-Order Spatial-Temporal Transformer

Ruidong Li, Hua Huo

MLJ 2026 pp. 8

doi:10.1007/S10994-025-06938-3 /mlj/2026/li2026mlj-distformer/

Abstract

Recent advances in transformer-based methods have significantly improved performance in 3D human pose estimation, establishing new benchmarks in the field. In this paper, we aim to capture the spatial relations of human joints within each video frame and the human dynamics across frames, proposing a novel Dual Inverse-Order Spatial-Temporal Transformer (DIST) block. Specifically, each processing stream incorporates two complementary components: an Enhanced Temporal Multi-head Self-Attention (ET-MHSA) module for temporal context modeling and an Enhanced Spatial Multi-head Self-Attention (ES-MHSA) unit for modeling spatial dependencies between joints. Building on this foundation, we construct DISTFormer by stacking multiple DIST blocks and further incorporate a novel Spatial-Temporal Enhanced Positional Embedding (ST-EPE). The embedding simultaneously extracts features from the perspectives of spatial structure and temporal position, facilitating in-depth interactions of spatio-temporal features within the model. In addition, we ingeniously integrate joint grouping information into the regression head to ensure the anatomical plausibility and accuracy of pose estimation. We conduct extensive experiments on the Human3.6M and MPI-INF-3DHP benchmark datasets. Results demonstrate that our model not only achieves superior performance with significantly fewer parameters but also outperforms multiple state-of-the-art methods. Specifically, on the Human3.6M dataset, compared to the baseline STCFormer, our approach achieves a 0.2 mm reduction in P1 error while reducing the parameter count by 29.1%.

PDF MLJ Semantic Scholar

Cite

Text

Li and Huo. "DISTFormer: Enhance 3D Human Pose Estimation via Dual Inverse-Order Spatial-Temporal Transformer." Machine Learning, 2026. doi:10.1007/S10994-025-06938-3

Markdown

[Li and Huo. "DISTFormer: Enhance 3D Human Pose Estimation via Dual Inverse-Order Spatial-Temporal Transformer." Machine Learning, 2026.](https://mlanthology.org/mlj/2026/li2026mlj-distformer/) doi:10.1007/S10994-025-06938-3

BibTeX

@article{li2026mlj-distformer,
  title     = {{DISTFormer: Enhance 3D Human Pose Estimation via Dual Inverse-Order Spatial-Temporal Transformer}},
  author    = {Li, Ruidong and Huo, Hua},
  journal   = {Machine Learning},
  year      = {2026},
  pages     = {8},
  doi       = {10.1007/S10994-025-06938-3},
  volume    = {115},
  url       = {https://mlanthology.org/mlj/2026/li2026mlj-distformer/}
}