VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness

Cha, SeungJu; Lee, Kwanyoung; Kim, Ye-Chan; Oh, Hyunwoo; Kim, Dong-Jin

doi:10.1109/CVPR52734.2025.00753

VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness

SeungJu Cha, Kwanyoung Lee, Ye-Chan Kim, Hyunwoo Oh, Dong-Jin Kim

CVPR 2025 pp. 8041-8050

doi:10.1109/CVPR52734.2025.00753 /cvpr/2025/cha2025cvpr-verbdiff/

Abstract

Recent large-scale text-to-image diffusion models generate photorealistic images but often struggle to accurately depict interactions between humans and objects due to their limited ability to differentiate various interaction words. In this work, we propose VerbDiff to address the challenge of capturing nuanced interactions within text-to-image diffusion models. VerbDiff is a novel text-to-image generation model that weakens the bias between interaction words and objects, enhancing the understanding of interactions. Specifically, we disentangle various interaction words from frequency-based anchor words and leverage localized interaction regions from generated images to help the model better capture semantics in distinctive words without extra conditions. Our approach enables the model to accurately understand the intended interaction between humans and objects, producing high-quality images with accurate interactions aligned with specified verbs. Extensive experiments on the HICO-DET dataset demonstrate the effectiveness of our method compared to previous approaches.

PDF CVPR Semantic Scholar

Cite

Text

Cha et al. "VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00753

Markdown

[Cha et al. "VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/cha2025cvpr-verbdiff/) doi:10.1109/CVPR52734.2025.00753

BibTeX

@inproceedings{cha2025cvpr-verbdiff,
  title     = {{VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness}},
  author    = {Cha, SeungJu and Lee, Kwanyoung and Kim, Ye-Chan and Oh, Hyunwoo and Kim, Dong-Jin},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {8041-8050},
  doi       = {10.1109/CVPR52734.2025.00753},
  url       = {https://mlanthology.org/cvpr/2025/cha2025cvpr-verbdiff/}
}