LINK: Learning Instance-Level Knowledge from Vision-Language Models for Human-Object Interaction Detection

Wu Eastman, Z Y; Li, Ya-Li; Wang, Yuan; Wang, Shengjin

LINK: Learning Instance-Level Knowledge from Vision-Language Models for Human-Object Interaction Detection

Z Y Wu Eastman, Ya-Li Li, Yuan Wang, Shengjin Wang

ICLR 2026

/iclr/2026/y2026iclr-link/

Abstract

Human-Object Interaction (HOI) detection with vision-language models (VLMs) has progressed rapidly, yet a trade-off persists between specialization and generalization. Two major challenges remain: (1) the sparsity of supervision, which hampers effective transfer of foundation models to HOI tasks, and (2) the absence of a generalizable architecture that can excel in both fully supervised and zero-shot scenarios. To address these issues, we propose \textbf{LINK}, \textbf{L}earning \textbf{IN}stance-level \textbf{K}nowledge from VLMs. First, we introduce a HOI detection framework equipped with a Human-Object Geometrical Encoder and a VLM Linking Decoder. By decoupling from detector-specific features, our design ensures plug-and-play compatibility with arbitrary object detectors and consistent adaptability across diverse settings. Building on this foundation, we develop a Progressive Learning Strategy under a teacher-student paradigm, which delivers dense supervision over all potential human-object pairs. By contrasting subtle spatial and semantic differences between positive and negative instances, the model learns robust and transferable HOI representations. LINK sets new state-of-the-art on SWiG-HOI, HICO-DET, and V-COCO across zero-shot, fully supervised, and open-vocabulary settings, with strong generalization to unseen interaction categories.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Wu Eastman et al. "LINK: Learning Instance-Level Knowledge from Vision-Language Models for Human-Object Interaction Detection." International Conference on Learning Representations, 2026.

Markdown

[Wu Eastman et al. "LINK: Learning Instance-Level Knowledge from Vision-Language Models for Human-Object Interaction Detection." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/y2026iclr-link/)

BibTeX

@inproceedings{y2026iclr-link,
  title     = {{LINK: Learning Instance-Level Knowledge from Vision-Language Models for Human-Object Interaction Detection}},
  author    = {Wu Eastman, Z Y and Li, Ya-Li and Wang, Yuan and Wang, Shengjin},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/y2026iclr-link/}
}