Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-Based Human Image Generation

Abstract

Vanilla text-to-image diffusion models struggle with generating accurate human images commonly resulting in imperfect anatomies such as unnatural postures or disproportionate limbs. Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls --- human-centric priors such as pose or depth maps --- during the image generation phase. This paper explores the integration of these human-centric priors directly into the model fine-tuning stage essentially eliminating the need for extra conditions at the inference stage. We realize this idea by proposing a human-centric alignment loss to strengthen human-related information from the textual prompts within the cross-attention maps. To ensure semantic detail richness and human structural accuracy during fine-tuning we introduce scale-aware and step-wise constraints within the diffusion process according to an in-depth analysis of the cross-attention layer. Extensive experiments show that our method largely improves over state-of-the-art text-to-image models to synthesize high-quality human images based on user-written prompts.

Cite

Text

Wang et al. "Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-Based Human Image Generation." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00807

Markdown

[Wang et al. "Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-Based Human Image Generation." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/wang2024cvpr-effective/) doi:10.1109/CVPR52733.2024.00807

BibTeX

@inproceedings{wang2024cvpr-effective,
  title     = {{Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-Based Human Image Generation}},
  author    = {Wang, Junyan and Sun, Zhenhong and Tan, Zhiyu and Chen, Xuanbai and Chen, Weihua and Li, Hao and Zhang, Cheng and Song, Yang},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {8446-8455},
  doi       = {10.1109/CVPR52733.2024.00807},
  url       = {https://mlanthology.org/cvpr/2024/wang2024cvpr-effective/}
}