Comprehensive Multi-Modal Prototypes Are Simple and Effective Classifiers for Vast-Vocabulary Object Detection

Yitong Chen, Wenhao Yao, Lingchen Meng, Sihong Wu, Zuxuan Wu, Yu-Gang Jiang

AAAI 2025 pp. 2320-2328

doi:10.1609/AAAI.V39I2.32232 /aaai/2025/chen2025aaai-comprehensive/

Abstract

Enabling models to recognize vast open-world categories has been a longstanding pursuit in object detection. By leveraging the generalization capabilities of vision-language models, current open-world detectors can recognize a broader range of vocabularies, despite being trained on limited categories. However, when the scale of the category vocabularies during training expands to a real-world level, previous classifiers aligned with coarse class names significantly reduce the recognition performance of these detectors. In this paper, we introduce Prova, a multi-modal prototype classifier for vast-vocabulary object detection. Prova extracts comprehensive multi-modal prototypes as initialization of alignment classifiers to tackle the vast-vocabulary object recognition failure problem. On V3Det, this simple method greatly enhances the performance among one-stage, two-stage, and DETR-based detectors with only additional projection layers in both supervised and open-vocabulary settings. In particular, Prova improves Faster R-CNN, FCOS, and DINO by 3.3, 6.2, and 2.9 AP respectively in the supervised setting of V3Det. For the open-vocabulary setting, Prova achieves a new state-of-the-art performance with 32.8 base AP and 11.0 novel AP, which is of 2.6 and 4.3 gain over the previous methods.

AAAI Semantic Scholar

Cite

Text

Chen et al. "Comprehensive Multi-Modal Prototypes Are Simple and Effective Classifiers for Vast-Vocabulary Object Detection." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I2.32232

Markdown

[Chen et al. "Comprehensive Multi-Modal Prototypes Are Simple and Effective Classifiers for Vast-Vocabulary Object Detection." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/chen2025aaai-comprehensive/) doi:10.1609/AAAI.V39I2.32232

BibTeX

@inproceedings{chen2025aaai-comprehensive,
  title     = {{Comprehensive Multi-Modal Prototypes Are Simple and Effective Classifiers for Vast-Vocabulary Object Detection}},
  author    = {Chen, Yitong and Yao, Wenhao and Meng, Lingchen and Wu, Sihong and Wu, Zuxuan and Jiang, Yu-Gang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {2320-2328},
  doi       = {10.1609/AAAI.V39I2.32232},
  url       = {https://mlanthology.org/aaai/2025/chen2025aaai-comprehensive/}
}