EMControl: Adding Conditional Control to Text-to-Image Diffusion Models via Expectation-Maximization

Abstract

Recent advances in diffusion models focus on efficiently handling conditional generative tasks without extra training. The process involves decomposing the result into two components: 1. unconditional sample, generated in the absence of conditions; 2. condition correction, adjusting unconditional sample to include the guidance image. This adjustment is quantified by the pixel-level measure, where the latent is decoded back into a pixel image, and the forward operator translates the noisy image into the guidance domain for comparison with the guidance image. To enhance the fidelity of condition correction, we propose a learnable latent forward operator, focusing on latent-space consistency with the expectation that this latent-space consistency approximates the pixel-level fidelity measure. The encoder translates the guidance image into the latent space, and a correctional operator is proposed to rectify model mismatching in the latent guidance model. The determination of the condition term and the correction estimation is akin to solving a blind inverse problem. Our EMControl employs the Expectation-Maximization (EM) algorithm to solve the blind inverse problem during the reverse sampling process. This technique ensures that samples, once consistent with the guidance, are accurately mapped back onto the noisy data manifold, adhering to the data's inherent distribution. The EMControl has proven its effectiveness by delivering superior performance in conditional diffusion generation tasks compared to previous approaches. Moreover, its application to multiple-condition scenarios underscores its versatility and robustness across a range of generative tasks.

Cite

Text

Wang et al. "EMControl: Adding Conditional Control to Text-to-Image Diffusion Models via Expectation-Maximization." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I7.32828

Markdown

[Wang et al. "EMControl: Adding Conditional Control to Text-to-Image Diffusion Models via Expectation-Maximization." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/wang2025aaai-emcontrol/) doi:10.1609/AAAI.V39I7.32828

BibTeX

@inproceedings{wang2025aaai-emcontrol,
  title     = {{EMControl: Adding Conditional Control to Text-to-Image Diffusion Models via Expectation-Maximization}},
  author    = {Wang, He and Dai, Longquan and Tang, Jinhui},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {7691-7699},
  doi       = {10.1609/AAAI.V39I7.32828},
  url       = {https://mlanthology.org/aaai/2025/wang2025aaai-emcontrol/}
}