Open-Domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities
Abstract
Large-scale multi-modal pre-training models such as CLIP and PaLI exhibit strong generalization on various visual domains and tasks. However, existing image classification benchmarks often evaluate recognition on a specific domain (e.g., outdoor images) or a specific task (e.g., classifying plant species), which falls short of evaluating whether pre-trained foundational models are universal visual recognizers. To address this, we formally present the task of Open-domain Visual Entity recognitioN (OVEN), where a model need to link an image onto a Wikipedia entity with respect to a text query. We construct OVEN by re-purposing 14 existing datasets with all labels grounded onto one single label space: Wikipedia entities. OVEN challenges models to select among six million possible Wikipedia entities, making it a general visual recognition benchmark with largest number of labels. Our study on state-of-the-art pre-trained models reveals large headroom in generalizing to the massive-scale label space. We show that a PaLI-based auto-regressive visual recognition model performs surprisingly well, even on Wikipedia entities that have never been seen during fine-tuning. We also find existing pre-trained models yield different unique strengths: while PaLI-based models obtains higher overall performance, CLIP-based models are better at recognizing tail entities.
Cite
Text
Hu et al. "Open-Domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01108Markdown
[Hu et al. "Open-Domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/hu2023iccv-opendomain/) doi:10.1109/ICCV51070.2023.01108BibTeX
@inproceedings{hu2023iccv-opendomain,
title = {{Open-Domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities}},
author = {Hu, Hexiang and Luan, Yi and Chen, Yang and Khandelwal, Urvashi and Joshi, Mandar and Lee, Kenton and Toutanova, Kristina and Chang, Ming-Wei},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {12065-12075},
doi = {10.1109/ICCV51070.2023.01108},
url = {https://mlanthology.org/iccv/2023/hu2023iccv-opendomain/}
}