StyleMotif: Multi-Modal Motion Stylization Using Style-Content Cross Fusion
Abstract
We present StyleMotif, a novel Stylized Motion Latent Diffusion model, generating motion conditioned on both content and style from multiple modalities. Unlike existing approaches that either focus on generating diverse motion content or transferring style from sequences, StyleMotif seamlessly synthesizes motion across a wide range of content while incorporating stylistic cues from multi-modal inputs, including motion, text, image, video, and audio. To achieve this, we introduce a style-content cross fusion mechanism and align a style encoder with a pre-trained multi-modal model, ensuring that the generated motion accurately captures the reference style while preserving realism. Extensive experiments demonstrate that our framework surpasses existing methods in stylized motion generation and exhibits emergent capabilities for multi-modal motion stylization, enabling more nuanced motion synthesis. Project Page: https://stylemotif.github.io.
Cite
Text
Guo et al. "StyleMotif: Multi-Modal Motion Stylization Using Style-Content Cross Fusion." International Conference on Computer Vision, 2025.Markdown
[Guo et al. "StyleMotif: Multi-Modal Motion Stylization Using Style-Content Cross Fusion." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/guo2025iccv-stylemotif/)BibTeX
@inproceedings{guo2025iccv-stylemotif,
title = {{StyleMotif: Multi-Modal Motion Stylization Using Style-Content Cross Fusion}},
author = {Guo, Ziyu and Lee, Young Yoon and Liu, Joseph and Ben-Shabat, Yizhak and Zordan, Victor and Kapadia, Mubbasir},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {13349-13359},
url = {https://mlanthology.org/iccv/2025/guo2025iccv-stylemotif/}
}