FullDiT: Video Generative Foundation Models with Multimodal Control via Full Attention

Abstract

Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.

Cite

Text

Ju et al. "FullDiT: Video Generative Foundation Models with Multimodal Control via Full Attention." International Conference on Computer Vision, 2025.

Markdown

[Ju et al. "FullDiT: Video Generative Foundation Models with Multimodal Control via Full Attention." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/ju2025iccv-fulldit/)

BibTeX

@inproceedings{ju2025iccv-fulldit,
  title     = {{FullDiT: Video Generative Foundation Models with Multimodal Control via Full Attention}},
  author    = {Ju, Xuan and Ye, Weicai and Liu, Quande and Wang, Qiulin and Wang, Xintao and Wan, Pengfei and Zhang, Di and Gai, Kun and Xu, Qiang},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {15737-15747},
  url       = {https://mlanthology.org/iccv/2025/ju2025iccv-fulldit/}
}