MotionDirector: Motion Customization of Text-to-Video Diffusion Models

Abstract

Large-scale pre-trained diffusion models have exhibited remarkable capabilities in diverse video generations. Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion. Adaptation methods have been developed for customizing appearance like subject or style, yet under-explored for motion. It is straightforward to extend mainstream adaption methods for motion customization, including full model tuning and Low-Rank Adaptions (LoRAs). However, the motion concept learned by these methods is often coupled with the limited appearances in the training videos, making it difficult to generalize the customized motion to other appearances. To overcome this challenge, we propose MotionDirector, with a dual-path LoRAs architecture to decouple the learning of appearance and motion. Further, we design a novel appearance-debiased temporal loss to mitigate the influence of appearance on the temporal training objective. Experimental results show the proposed method can generate videos of diverse appearances for the customized motions. Our method also supports various downstream applications, such as the mixing of different videos with their appearance and motion respectively, and animating a single image with customized motions. The project website is at: MotionDirector.

Cite

Text

Zhao et al. "MotionDirector: Motion Customization of Text-to-Video Diffusion Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72992-8_16

Markdown

[Zhao et al. "MotionDirector: Motion Customization of Text-to-Video Diffusion Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/zhao2024eccv-motiondirector/) doi:10.1007/978-3-031-72992-8_16

BibTeX

@inproceedings{zhao2024eccv-motiondirector,
  title     = {{MotionDirector: Motion Customization of Text-to-Video Diffusion Models}},
  author    = {Zhao, Rui and Gu, Yuchao and Wu, Jay Zhangjie and Zhang, David Junhao and Liu, Jia-Wei and Wu, Weijia and Keppo, Jussi and Shou, Mike Zheng},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72992-8_16},
  url       = {https://mlanthology.org/eccv/2024/zhao2024eccv-motiondirector/}
}