Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection
Abstract
Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all <human, verb, object> triplets of interest in an image, even those that are not pre-defined in the training set. Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs) to enhance the generalization ability of interaction representations. However, the visual features produced by VLMs are holistic and coarse-grained, which contradicts the nature of detection tasks. To address this issue, we propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI). This framework includes an Attention Bias Guidance (ABG) component, which guides the VLM to produce fine-grained instance-level interaction features according to the attention bias provided by the HOI detector. It also includes a Large Language Model (LLM)-based Supervision Guidance (LSG) component, which provides fine-grained token-level supervision for the HOI detector by the LLM component of the VLM. LSG enhances the ability of ABG to generate high-quality attention bias. We conduct extensive experiments on two popular benchmarks: HICO-DET and V-COCO, consistently achieving superior performance in the open vocabulary and closed settings. The code will be released in Github.
Cite
Text
Hu et al. "Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection." International Conference on Computer Vision, 2025.Markdown
[Hu et al. "Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/hu2025iccv-bilateral/)BibTeX
@inproceedings{hu2025iccv-bilateral,
title = {{Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection}},
author = {Hu, Yupeng and Ding, Changxing and Sun, Chang and Huang, Shaoli and Xu, Xiangmin},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {20126-20136},
url = {https://mlanthology.org/iccv/2025/hu2025iccv-bilateral/}
}