MCRE: Multimodal Conditional Representation and Editing for Text-Motion Generation
Abstract
Recent advancements in text-to-motion generation models have shown impressive capabilities in creating high-fidelity motion sequences. However, generating desired sequences using only text prompts is challenging due to the complexity of prompt engineering. We present the Multimodal Conditional Representation and Editing (MCRE) module, a lightweight adapter for text-to-motion generation and editing. MCRE unifies text and motion conditions into the CLIP representation space, enabling precise and flexible multimodal control. Despite its simplicity, MCRE’s hybrid motion and text-conditioned editing capabilities achieve comparable or better performance than fully fine-tuned models. The highly disentangled CLIP representations enable flexible motion sequence editing by combining multiple conditions, resulting in versatile and high-quality motion generation and editing.
Cite
Text
Sun et al. "MCRE: Multimodal Conditional Representation and Editing for Text-Motion Generation." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-92591-7_26Markdown
[Sun et al. "MCRE: Multimodal Conditional Representation and Editing for Text-Motion Generation." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/sun2024eccvw-mcre/) doi:10.1007/978-3-031-92591-7_26BibTeX
@inproceedings{sun2024eccvw-mcre,
title = {{MCRE: Multimodal Conditional Representation and Editing for Text-Motion Generation}},
author = {Sun, Tengjiao and Li, Xiang and Shi, Tianyu and Peng, Jiahui and Zheng, Sheng and Kim, Hansung},
booktitle = {European Conference on Computer Vision Workshops},
year = {2024},
pages = {406-414},
doi = {10.1007/978-3-031-92591-7_26},
url = {https://mlanthology.org/eccvw/2024/sun2024eccvw-mcre/}
}