Learning Where It Matters: Responsible and Interpretable Text-to-Image Generation with Background Consistency

Abstract

Text-to-image diffusion models have achieved remarkable progress, yet they still struggle to produce unbiased and responsible outputs. A promising direction is to manipulate the bottleneck space of the U-Net (the $h$-space), which provides \textit{interpretability} and \textit{controllability}. However, existing methods rely on learning attributes from the entire image, entangling them with spurious features and offering no corrective mechanisms at inference. This uniform reliance leads to poor subject alignment, fairness issues, reduced photorealism, and incoherent backgrounds in scene-specific prompts. To address these challenges, we propose two complementary innovations for training and inference. First, we introduce a spatially focused concept learning framework that disentangles target attributes into concept vectors by suppressing target attribute features within the multi-head cross-attention (MCA) modules and attenuating the encoder output (i.e., $h$-vector) to ensure the concept vector exclusively captures target attribute features. In addition, we introduce a spatially weighted reconstruction loss to emphasize regions relevant to the target attribute. Second, we design an inference-time strategy that improves background consistency by enhancing low-frequency components in the $h$-space. Experiments demonstrate that our approach improves fairness, subject fidelity, and background coherence while preserving visual quality and prompt alignment, outperforming state-of-the-art $h$-space methods. The code is provided at https://github.com/Moslem-Sh21/learning-where-it-matters.

Cite

Text

Shokrolahi et al. "Learning Where It Matters: Responsible and Interpretable Text-to-Image Generation with Background Consistency." Transactions on Machine Learning Research, 2026.

Markdown

[Shokrolahi et al. "Learning Where It Matters: Responsible and Interpretable Text-to-Image Generation with Background Consistency." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/shokrolahi2026tmlr-learning/)

BibTeX

@article{shokrolahi2026tmlr-learning,
  title     = {{Learning Where It Matters: Responsible and Interpretable Text-to-Image Generation with Background Consistency}},
  author    = {Shokrolahi, Sayedmoslem and Kang, Jae-Mo and Kim, Il-Min},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/shokrolahi2026tmlr-learning/}
}