Controllable Person Image Synthesis with Pose-Constrained Latent Diffusion

Abstract

Controllable person image synthesis aims at rendering a source image based on user-specified changes in body pose or appearance. Prior art approaches leverage pixel-level denoising diffusion models conditioned on the coarse skeleton via cross-attention. This leads to two limitations: low efficiency and inaccurate condition information. To address both issues, a novel Pose-Constrained Latent Diffusion model (PoCoLD) is introduced. Rather than using the skeleton as a sparse pose representation, we exploit DensePose which offers much richer body structure information. To effectively capitalize DensePose at a low cost, we propose an efficient pose-constrained attention module that is capable of modeling the complex interplay between appearance and pose. Extensive experiments show that our PoCoLD outperforms the state-of-the-art competitors in image synthesis fidelity. Critically, it runs 2x faster and consumes 3.6x smaller memory than the latest diffusion-model-based alternative during inference.

Cite

Text

Han et al. "Controllable Person Image Synthesis with Pose-Constrained Latent Diffusion." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.02081

Markdown

[Han et al. "Controllable Person Image Synthesis with Pose-Constrained Latent Diffusion." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/han2023iccv-controllable/) doi:10.1109/ICCV51070.2023.02081

BibTeX

@inproceedings{han2023iccv-controllable,
  title     = {{Controllable Person Image Synthesis with Pose-Constrained Latent Diffusion}},
  author    = {Han, Xiao and Zhu, Xiatian and Deng, Jiankang and Song, Yi-Zhe and Xiang, Tao},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {22768-22777},
  doi       = {10.1109/ICCV51070.2023.02081},
  url       = {https://mlanthology.org/iccv/2023/han2023iccv-controllable/}
}