3D Human Pose Estimation via Non-Causal Retentive Networks
Abstract
Temporal dependencies are essential in 3D human pose estimation to mitigate depth ambiguity. Previous methods typically use a fixed-length sliding window to capture these dependencies. However, they treat past and future frames equally, ignoring the fact that relying on too many future frames increases the inference latency. In this paper, we present a 3D human pose estimation model based on Retentive Networks (RetNet) that incorporates temporal information by utilizing a large number of past frames and a few future frames. The Non-Causal RetNet (NC-RetNet) is designed to allow the originally causal RetNet to be aware of future information. Additionally, we propose a knowledge transfer strategy, i.e., training the model with a larger chunk size and using a smaller chunk size during inference, to reduce latency while maintaining comparable accuracy. Extensive experiments have been conducted on the Human3.6M and MPI-INF-3DHP datasets, and the results demonstrate that our method achieves state-of-the-art performance. Code and models are available at https://github.com/Kelly510/PoseRetN
Cite
Text
Zheng et al. "3D Human Pose Estimation via Non-Causal Retentive Networks." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73414-4_7Markdown
[Zheng et al. "3D Human Pose Estimation via Non-Causal Retentive Networks." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/zheng2024eccv-3d/) doi:10.1007/978-3-031-73414-4_7BibTeX
@inproceedings{zheng2024eccv-3d,
title = {{3D Human Pose Estimation via Non-Causal Retentive Networks}},
author = {Zheng, Kaili and Lu, Feixiang and Lv, Yihao and Zhang, Liangjun and Guo, Chenyi and Wu, Ji},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-73414-4_7},
url = {https://mlanthology.org/eccv/2024/zheng2024eccv-3d/}
}