Controllable Person Image Synthesis with Pose-Constrained Latent Diffusion
Abstract
Controllable person image synthesis aims at rendering a source image based on user-specified changes in body pose or appearance. Prior art approaches leverage pixel-level denoising diffusion models conditioned on the coarse skeleton via cross-attention. This leads to two limitations: low efficiency and inaccurate condition information. To address both issues, a novel Pose-Constrained Latent Diffusion model (PoCoLD) is introduced. Rather than using the skeleton as a sparse pose representation, we exploit DensePose which offers much richer body structure information. To effectively capitalize DensePose at a low cost, we propose an efficient pose-constrained attention module that is capable of modeling the complex interplay between appearance and pose. Extensive experiments show that our PoCoLD outperforms the state-of-the-art competitors in image synthesis fidelity. Critically, it runs 2x faster and consumes 3.6x smaller memory than the latest diffusion-model-based alternative during inference.
Cite
Text
Han et al. "Controllable Person Image Synthesis with Pose-Constrained Latent Diffusion." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.02081Markdown
[Han et al. "Controllable Person Image Synthesis with Pose-Constrained Latent Diffusion." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/han2023iccv-controllable/) doi:10.1109/ICCV51070.2023.02081BibTeX
@inproceedings{han2023iccv-controllable,
title = {{Controllable Person Image Synthesis with Pose-Constrained Latent Diffusion}},
author = {Han, Xiao and Zhu, Xiatian and Deng, Jiankang and Song, Yi-Zhe and Xiang, Tao},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {22768-22777},
doi = {10.1109/ICCV51070.2023.02081},
url = {https://mlanthology.org/iccv/2023/han2023iccv-controllable/}
}