BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

Abstract

Diffusion Transformer has shown remarkable abilities in generating high-fidelity videos, delivering visually coherent frames and rich details over extended durations. However, existing video generation models still fall short in subject-consistent video generation due to an inherent difficulty in parsing prompts that specify complex spatial relationships, temporal logic, and interactions among multiple subjects. To address this issue, we propose BindWeave, a unified framework that handles a broad range of subject-to-video scenarios from single-subject cases to complex multi-subject scenes with heterogeneous entities. To bind complex prompt semantics to concrete visual subjects, we introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities and disentangle roles, attributes, and interactions, yielding subject-aware hidden states that condition the diffusion transformer for high-fidelity subject-consistent video generation. Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos, outperforming existing open-source and commercial models.

Cite

Text

Li et al. "BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration." International Conference on Learning Representations, 2026.

Markdown

[Li et al. "BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/li2026iclr-bindweave/)

BibTeX

@inproceedings{li2026iclr-bindweave,
  title     = {{BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration}},
  author    = {Li, Zhaoyang and Qian, Dongjun and Su, Kai and Diao, Qishuai and Xia, Xiangyang and Liu, Chang and Yang, Wenfei and Zhang, Tianzhu and Yuan, Zehuan},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/li2026iclr-bindweave/}
}