Unleashing Potentials of Vision-Language Models for Zero-Shot HOI Detection

Abstract

Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions as <human action object> triplets. Recent advancements in pre-trained vision-language model (VLM) have improved zero-shot HOI detection enabling identification of unseen triplets. However existing methods leverage the VLM as an additional encoder only for interaction prediction not for human/object detection. This limitation hinders their ability to detect unseen objects. Furthermore the additional encoder increases both model size and computational cost. This paper proposes a novel HOI detection framework ECI-HOI which unleashes potentials of the pre-trained VLM for the zero-shot HOI detection by leveraging it for both of the sub-tasks. We first employ CLIP as a single image encoder reducing redundancy in the network architecture. In addition we propose an instance selector and a HO pair decoder to effectively harmonize the human/object detection and the interaction prediction in zero-shot manner. We evaluate our model under various settings on HICO-DET and our two new testsets: out-of-distribution image testset and novel object testset. Our model outperforms the state-of-the-art models while reducing the model size by more than 50% especially achieving a +10.01 mAP improvement under the unseen object setting on HICO-DET. The results on the proposed datasets highlight the zero-shot performance of our model on more challenging settings.

Cite

Text

Yamada et al. "Unleashing Potentials of Vision-Language Models for Zero-Shot HOI Detection." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Yamada et al. "Unleashing Potentials of Vision-Language Models for Zero-Shot HOI Detection." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/yamada2025wacv-unleashing/)

BibTeX

@inproceedings{yamada2025wacv-unleashing,
  title     = {{Unleashing Potentials of Vision-Language Models for Zero-Shot HOI Detection}},
  author    = {Yamada, Moyuru and Dharamshi, Nimish and Kohli, Ayushi and Kasu, Prasad and Khan, Ainulla and Ghulyani, Manu},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {5751-5760},
  url       = {https://mlanthology.org/wacv/2025/yamada2025wacv-unleashing/}
}