DiT4Edit: Diffusion Transformer for Image Editing

Abstract

Despite recent advances in UNet-based image editing, methods for shape-aware object editing in high-resolution images are still lacking. Compared to UNet, Diffusion Transformers (DiT) demonstrate superior capabilities to effectively capture the long-range dependencies among patches, leading to higher-quality image generation. In this paper, we propose DiT4Edit, the first Diffusion Transformer-based image editing framework. Specifically, DiT4Edit uses the DPM-Solver inversion algorithm to obtain the inverted latents, reducing the number of steps compared to the DDIM inversion algorithm commonly used in UNet-based frameworks. Additionally, we design unified attention control and patch merging, tailored for transformer computation streams. This integration allows our framework to generate higher-quality edited images faster. Our design leverages the advantages of DiT, enabling it to surpass UNet structures in image editing, especially in high-resolution and arbitrary-size images. Extensive experiments demonstrate the strong performance of DiT4Edit in various editing scenarios, highlighting the potential of diffusion transformers for image editing.

Cite

Text

Feng et al. "DiT4Edit: Diffusion Transformer for Image Editing." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I3.32304

Markdown

[Feng et al. "DiT4Edit: Diffusion Transformer for Image Editing." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/feng2025aaai-dit/) doi:10.1609/AAAI.V39I3.32304

BibTeX

@inproceedings{feng2025aaai-dit,
  title     = {{DiT4Edit: Diffusion Transformer for Image Editing}},
  author    = {Feng, Kunyu and Ma, Yue and Wang, Bingyuan and Qi, Chenyang and Chen, Haozhe and Chen, Qifeng and Wang, Zeyu},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {2969-2977},
  doi       = {10.1609/AAAI.V39I3.32304},
  url       = {https://mlanthology.org/aaai/2025/feng2025aaai-dit/}
}