HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

Shan Ning, Longtian Qiu, Yongfei Liu, Xuming He

CVPR 2023 pp. 23507-23517

doi:10.1109/CVPR52729.2023.02251 /cvpr/2023/ning2023cvpr-hoiclip/

Abstract

Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions. Recently, Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors via knowledge distillation. However, such approaches often rely on large-scale training data and suffer from inferior performance under few/zero-shot scenarios. In this paper, we propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization. In detail, we first introduce a novel interaction decoder to extract informative regions in the visual feature map of CLIP via a cross-attention mechanism, which is then fused with the detection backbone by a knowledge integration block for more accurate human-object pair detection. In addition, prior knowledge in CLIP text encoder is leveraged to generate a classifier by embedding HOI descriptions. To distinguish fine-grained interactions, we build a verb classifier from training data via visual semantic arithmetic and a lightweight verb representation adapter. Furthermore, we propose a training-free enhancement to exploit global HOI predictions from CLIP. Extensive experiments demonstrate that our method outperforms the state of the art by a large margin on various settings, e.g. +4.04 mAP on HICO-Det. The source code is available in https://github.com/Artanic30/HOICLIP.

PDF CVPR Semantic Scholar

Cite

Text

Ning et al. "HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.02251

Markdown

[Ning et al. "HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/ning2023cvpr-hoiclip/) doi:10.1109/CVPR52729.2023.02251

BibTeX

@inproceedings{ning2023cvpr-hoiclip,
  title     = {{HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models}},
  author    = {Ning, Shan and Qiu, Longtian and Liu, Yongfei and He, Xuming},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {23507-23517},
  doi       = {10.1109/CVPR52729.2023.02251},
  url       = {https://mlanthology.org/cvpr/2023/ning2023cvpr-hoiclip/}
}