Comprehensive Multi-Modal Prototypes Are Simple and Effective Classifiers for Vast-Vocabulary Object Detection

Abstract

Enabling models to recognize vast open-world categories has been a longstanding pursuit in object detection. By leveraging the generalization capabilities of vision-language models, current open-world detectors can recognize a broader range of vocabularies, despite being trained on limited categories. However, when the scale of the category vocabularies during training expands to a real-world level, previous classifiers aligned with coarse class names significantly reduce the recognition performance of these detectors. In this paper, we introduce Prova, a multi-modal prototype classifier for vast-vocabulary object detection. Prova extracts comprehensive multi-modal prototypes as initialization of alignment classifiers to tackle the vast-vocabulary object recognition failure problem. On V3Det, this simple method greatly enhances the performance among one-stage, two-stage, and DETR-based detectors with only additional projection layers in both supervised and open-vocabulary settings. In particular, Prova improves Faster R-CNN, FCOS, and DINO by 3.3, 6.2, and 2.9 AP respectively in the supervised setting of V3Det. For the open-vocabulary setting, Prova achieves a new state-of-the-art performance with 32.8 base AP and 11.0 novel AP, which is of 2.6 and 4.3 gain over the previous methods.

Cite

Text

Chen et al. "Comprehensive Multi-Modal Prototypes Are Simple and Effective Classifiers for Vast-Vocabulary Object Detection." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I2.32232

Markdown

[Chen et al. "Comprehensive Multi-Modal Prototypes Are Simple and Effective Classifiers for Vast-Vocabulary Object Detection." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/chen2025aaai-comprehensive/) doi:10.1609/AAAI.V39I2.32232

BibTeX

@inproceedings{chen2025aaai-comprehensive,
  title     = {{Comprehensive Multi-Modal Prototypes Are Simple and Effective Classifiers for Vast-Vocabulary Object Detection}},
  author    = {Chen, Yitong and Yao, Wenhao and Meng, Lingchen and Wu, Sihong and Wu, Zuxuan and Jiang, Yu-Gang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {2320-2328},
  doi       = {10.1609/AAAI.V39I2.32232},
  url       = {https://mlanthology.org/aaai/2025/chen2025aaai-comprehensive/}
}