Benefit from Seen: Enhancing Open-Vocabulary Object Detection by Bridging Visual and Textual Co-Occurrence Knowledge
Abstract
Open-Vocabulary Object Detection (OVOD) aims to localize and recognize objects from both known and novel categories. However, existing methods rely heavily on internal knowledge from Vision-Language Models (VLMs), restricting their generalization to unseen categories due to limited contextual understanding. To address this, we propose CODet, a plug-and-play framework that enhances OVOD by integrating object co-occurrence ---- a form of external contextual knowledge pervasive in real-world scenes. Specifically, CODet extracts visual co-occurrence patterns from images, aligns them with textual dependencies validated by Large Language Models (LLMs), and injects contextual co-occurrence pseudo-labels as external knowledge to guide detection. Without architectural changes, CODet consistently improves five state-of-the-art VLM-based detectors across two benchmarks, achieving notable gains (up to +2.3 AP on novel categories). Analyses further confirm its ability to encode meaningful contextual guidance, advancing open-world perception by bridging visual and textual co-occurrence knowledge.
Cite
Text
Li et al. "Benefit from Seen: Enhancing Open-Vocabulary Object Detection by Bridging Visual and Textual Co-Occurrence Knowledge." International Conference on Computer Vision, 2025.Markdown
[Li et al. "Benefit from Seen: Enhancing Open-Vocabulary Object Detection by Bridging Visual and Textual Co-Occurrence Knowledge." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/li2025iccv-benefit/)BibTeX
@inproceedings{li2025iccv-benefit,
title = {{Benefit from Seen: Enhancing Open-Vocabulary Object Detection by Bridging Visual and Textual Co-Occurrence Knowledge}},
author = {Li, Yanqi and Niu, Jianwei and Ren, Tao},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {22110-22119},
url = {https://mlanthology.org/iccv/2025/li2025iccv-benefit/}
}