MotionChain: Conversational Motion Controllers via Multimodal Prompts
Abstract
Recent advancements in language models have demonstrated their adeptness in conducting multi-turn dialogues and retaining conversational context. However, this proficiency remains largely unexplored in other multimodal generative models, particularly in human motion models. By integrating multi-turn conversations in controlling continuous virtual human movements, generative human motion models can achieve an intuitive and step-by-step process of human task execution for humanoid robotics, game agents, or other embodied systems. In this work, we present MotionChain, a conversational human motion controller that generates continuous and long-term human motion through multimodal prompts. Specifically, MotionChain consists of multi-modal tokenizers that transform various data types such as text, image, and motion, into discrete tokens, coupled with a Vision-Motion-aware Language model. By leveraging large-scale language, vision-language, and vision-motion data to assist motion-related generation tasks, MotionChain thus comprehends each instruction in multi-turn conversation and generates human motions followed by these prompts. Extensive experiments validate the efficacy of MotionChain, demonstrating state-of-the-art performance in conversational motion generation, as well as more intuitive manners of controlling and interacting with virtual humans.
Cite
Text
Jiang et al. "MotionChain: Conversational Motion Controllers via Multimodal Prompts." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73347-5_4Markdown
[Jiang et al. "MotionChain: Conversational Motion Controllers via Multimodal Prompts." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/jiang2024eccv-motionchain/) doi:10.1007/978-3-031-73347-5_4BibTeX
@inproceedings{jiang2024eccv-motionchain,
title = {{MotionChain: Conversational Motion Controllers via Multimodal Prompts}},
author = {Jiang, Biao and Chen, Xin and Zhang, Chi and Yin, Fukun and Li, Zhuoyuan and Yu, Gang and Fan, Jiayuan},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-73347-5_4},
url = {https://mlanthology.org/eccv/2024/jiang2024eccv-motionchain/}
}