MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing
Abstract
Despite the success in large-scale text-to-image generation and text-conditioned image editing, existing methods still struggle to produce consistent generation and editing results. For example, generation approaches usually fail to synthesize multiple images of the same objects/characters but with different views or poses. Meanwhile, existing editing methods either fail to achieve effective complex non-rigid editing while maintaining the overall textures and identity, or require time-consuming fine-tuning to capture the image-specific appearance. In this paper, we develop MasaCtrl, a tuning-free method to achieve consistent image generation and complex non-rigid image editing simultaneously. Specifically, MasaCtrl converts existing self-attention in diffusion models into mutual self-attention, so that it can query correlated local contents and textures from source images for consistency. To further alleviate the query confusion between foreground and background, we propose a mask-guided mutual self-attention strategy, where the mask can be easily extracted from the cross-attention maps. Extensive experiments show that the proposed MasaCtrl can produce impressive results in both consistent image generation and complex non-rigid real image editing.
Cite
Text
Cao et al. "MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.02062Markdown
[Cao et al. "MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/cao2023iccv-masactrl/) doi:10.1109/ICCV51070.2023.02062BibTeX
@inproceedings{cao2023iccv-masactrl,
title = {{MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing}},
author = {Cao, Mingdeng and Wang, Xintao and Qi, Zhongang and Shan, Ying and Qie, Xiaohu and Zheng, Yinqiang},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {22560-22570},
doi = {10.1109/ICCV51070.2023.02062},
url = {https://mlanthology.org/iccv/2023/cao2023iccv-masactrl/}
}