MCRE: Multimodal Conditional Representation and Editing for Text-Motion Generation

Abstract

Recent advancements in text-to-motion generation models have shown impressive capabilities in creating high-fidelity motion sequences. However, generating desired sequences using only text prompts is challenging due to the complexity of prompt engineering. We present the Multimodal Conditional Representation and Editing (MCRE) module, a lightweight adapter for text-to-motion generation and editing. MCRE unifies text and motion conditions into the CLIP representation space, enabling precise and flexible multimodal control. Despite its simplicity, MCRE’s hybrid motion and text-conditioned editing capabilities achieve comparable or better performance than fully fine-tuned models. The highly disentangled CLIP representations enable flexible motion sequence editing by combining multiple conditions, resulting in versatile and high-quality motion generation and editing.

Cite

Text

Sun et al. "MCRE: Multimodal Conditional Representation and Editing for Text-Motion Generation." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-92591-7_26

Markdown

[Sun et al. "MCRE: Multimodal Conditional Representation and Editing for Text-Motion Generation." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/sun2024eccvw-mcre/) doi:10.1007/978-3-031-92591-7_26

BibTeX

@inproceedings{sun2024eccvw-mcre,
  title     = {{MCRE: Multimodal Conditional Representation and Editing for Text-Motion Generation}},
  author    = {Sun, Tengjiao and Li, Xiang and Shi, Tianyu and Peng, Jiahui and Zheng, Sheng and Kim, Hansung},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2024},
  pages     = {406-414},
  doi       = {10.1007/978-3-031-92591-7_26},
  url       = {https://mlanthology.org/eccvw/2024/sun2024eccvw-mcre/}
}