Exploring Multimodal Diffusion Transformers for Enhanced Prompt-Based Image Editing

Abstract

Transformer-based diffusion models have recently superseded traditional U-Net architectures, with multimodal diffusion transformers (MM-DiT) emerging as the dominant approach in state-of-the-art models like Stable Diffusion 3 and Flux.1. Previous approaches have relied on unidirectional cross-attention mechanisms, with information flowing from text embeddings to image latents. In contrast, MM-DiT introduces a unified attention mechanism that concatenates input projections from both modalities and performs a single full attention operation, allowing bidirectional information flow between text and image branches. This architectural shift presents significant challenges for existing editing techniques. In this paper, we systematically analyze MM-DiT's attention mechanism by decomposing attention matrices into four distinct blocks, revealing their inherent characteristics. Through these analyses, we propose a robust, prompt-based image editing method for MM-DiT that supports global to local edits across various MM-DiT variants, including few-step models. We believe our findings bridge the gap between existing U-Net-based methods and emerging architectures, offering deeper insights into MM-DiT's behavioral patterns.

Cite

Text

Shin et al. "Exploring Multimodal Diffusion Transformers for Enhanced Prompt-Based Image Editing." International Conference on Computer Vision, 2025.

Markdown

[Shin et al. "Exploring Multimodal Diffusion Transformers for Enhanced Prompt-Based Image Editing." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/shin2025iccv-exploring/)

BibTeX

@inproceedings{shin2025iccv-exploring,
  title     = {{Exploring Multimodal Diffusion Transformers for Enhanced Prompt-Based Image Editing}},
  author    = {Shin, Joonghyuk and Hwang, Alchan and Kim, Yujin and Kim, Daneul and Park, Jaesik},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {19492-19502},
  url       = {https://mlanthology.org/iccv/2025/shin2025iccv-exploring/}
}