Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, Kwan-Yee K. Wong

ICCV 2025 pp. 5934-5943

/iccv/2025/lv2025iccv-rethinking/

Abstract

Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose Temperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve text-image alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href https://github.com/Vchitect/TACA https://github.com/Vchitect/TACA .

PDF ICCV Semantic Scholar

Cite

Text

Lv et al. "Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers." International Conference on Computer Vision, 2025.

Markdown

[Lv et al. "Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/lv2025iccv-rethinking/)

BibTeX

@inproceedings{lv2025iccv-rethinking,
  title     = {{Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers}},
  author    = {Lv, Zhengyao and Pan, Tianlin and Si, Chenyang and Chen, Zhaoxi and Zuo, Wangmeng and Liu, Ziwei and Wong, Kwan-Yee K.},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {5934-5943},
  url       = {https://mlanthology.org/iccv/2025/lv2025iccv-rethinking/}
}