UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing
Abstract
Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks. The code is available at https://github.com/liyiheng23/UniPose.
Cite
Text
Li et al. "UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02589Markdown
[Li et al. "UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/li2025cvpr-unipose/) doi:10.1109/CVPR52734.2025.02589BibTeX
@inproceedings{li2025cvpr-unipose,
title = {{UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing}},
author = {Li, Yiheng and Hou, Ruibing and Chang, Hong and Shan, Shiguang and Chen, Xilin},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {27805-27815},
doi = {10.1109/CVPR52734.2025.02589},
url = {https://mlanthology.org/cvpr/2025/li2025cvpr-unipose/}
}