CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection

Abstract

Zero-shot Human-Object Interaction (HOI) detection aims to identify both seen and unseen HOI categories. A strong zero-shot HOI detector is supposed to be not only capable of discriminating novel interactions but also robust to positional distribution discrepancy between seen and unseen categories when locating human-object pairs. However, top-performing zero-shot HOI detectors rely on seen and predefined unseen categories to distill knowledge from CLIP and jointly locate human-object pairs without considering the potential positional distribution discrepancy, leading to impaired transferability. In this paper, we introduce CLIP4HOI, a novel framework for zero-shot HOI detection. CLIP4HOI is developed on the vision-language model CLIP and ameliorates the above issues in the following two aspects. First, to avoid the model from overfitting to the joint positional distribution of seen human-object pairs, we seek to tackle the problem of zero-shot HOI detection in a disentangled two-stage paradigm. To be specific, humans and objects are independently identified and all feasible human-object pairs are processed by Human-Object interactor for pairwise proposal generation. Second, to facilitate better transferability, the CLIP model is elaborately adapted into a fine-grained HOI classifier for proposal discrimination, avoiding data-sensitive knowledge distillation. Finally, experiments on prevalent benchmarks show that our CLIP4HOI outperforms previous approaches on both rare and unseen categories, and sets a series of state-of-the-art records under a variety of zero-shot settings.

Cite

Text

Mao et al. "CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection." Neural Information Processing Systems, 2023.

Markdown

[Mao et al. "CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/mao2023neurips-clip4hoi/)

BibTeX

@inproceedings{mao2023neurips-clip4hoi,
  title     = {{CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection}},
  author    = {Mao, Yunyao and Deng, Jiajun and Zhou, Wengang and Li, Li and Fang, Yao and Li, Houqiang},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/mao2023neurips-clip4hoi/}
}