Visual Relation Diffusion for Human-Object Interaction Detection
Abstract
Human-object interaction (HOI) detection relies on fine-grained visual understanding to distinguish complex relationships between humans and objects. While recent generative diffusion models have demonstrated remarkable capability in learning detailed visual concepts through pixel-level generation, their potential for interaction-level relationship modeling remains largely unexplored. To bridge this gap, we propose a Visual Relation Diffusion model (VRDiff), which introduces dense visual relation conditions to guide interaction understanding. Specifically, we encode interaction-aware condition representations that capture both spatial responsiveness and contextual semantics of human-object pairs, conditioning the diffusion process purely on visual features rather than text-based inputs. Furthermore, we refine these relation representations through generative feedback from the diffusion model, enhancing HOI detection without requiring image synthesis. Extensive experiments on the HICO-DET benchmark demonstrate that VRDiff achieves competitive results under both standard and zero-shot HOI detection settings.
Cite
Text
Cao et al. "Visual Relation Diffusion for Human-Object Interaction Detection." International Conference on Computer Vision, 2025.Markdown
[Cao et al. "Visual Relation Diffusion for Human-Object Interaction Detection." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/cao2025iccv-visual/)BibTeX
@inproceedings{cao2025iccv-visual,
title = {{Visual Relation Diffusion for Human-Object Interaction Detection}},
author = {Cao, Ping and Tang, Yepeng and Zhang, Chunjie and Zheng, Xiaolong and Liang, Chao and Wei, Yunchao and Zhao, Yao},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {23551-23560},
url = {https://mlanthology.org/iccv/2025/cao2025iccv-visual/}
}