Enabling ControlNet to Follow Localized Descriptions Using Cross-Attention Control
Abstract
ControlNet enables fine-grained control over image layout in prominent generators like Stable Diffusion. However, it lacks the ability to take into account localized textual descriptions that indicate which image region is described by which phrase in the prompt. In this work, we enable ControlNet to use localized descriptions using a training-free approach that modifies the cross-attention scores during generation. For doing so, we adapt and investigate several existing cross-attention control methods and identify shortcomings that cause failure or image degradation under some conditions. To address these shortcomings, we develop a novel cross-attention manipulation method. Qualitative and quantitative experimental studies demonstrate the effectiveness of the proposed augmented ControlNet.
Cite
Text
Lukovnikov and Fischer. "Enabling ControlNet to Follow Localized Descriptions Using Cross-Attention Control." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2025. doi:10.1007/978-3-032-05981-9_19Markdown
[Lukovnikov and Fischer. "Enabling ControlNet to Follow Localized Descriptions Using Cross-Attention Control." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2025.](https://mlanthology.org/ecmlpkdd/2025/lukovnikov2025ecmlpkdd-enabling/) doi:10.1007/978-3-032-05981-9_19BibTeX
@inproceedings{lukovnikov2025ecmlpkdd-enabling,
title = {{Enabling ControlNet to Follow Localized Descriptions Using Cross-Attention Control}},
author = {Lukovnikov, Denis and Fischer, Asja},
booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
year = {2025},
pages = {310-327},
doi = {10.1007/978-3-032-05981-9_19},
url = {https://mlanthology.org/ecmlpkdd/2025/lukovnikov2025ecmlpkdd-enabling/}
}