Taxonomy-Aware Evaluation of Vision-Language Models
Abstract
When a vision-language model (VLM) is prompted to identify an entity depicted in an image, it may answer "I see a conifer," rather than the specific label "Norway spruce". This raises two issues for evaluation: Firstly, the unconstrained generated text needs to be mapped to the evaluation label space (i.e., "conifer"). Secondly, a useful classification measure should give partial credit to less specific, but not incorrect, answers ("Norway spruce" being a type of "conifer"). To meet these requirements, we propose a framework for evaluating unconstrained text predictions such as those generated from a vision-language model against a taxonomy. Specifically, we propose the use of hierarchical precision and recall measures to assess the level of correctness and specificity of predictions with regard to a taxonomy. Experimentally, we first show that existing text similarity measures do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the generated text and the ground truth labels. Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our proposed taxonomic evaluation scheme.
Cite
Text
Snæbjarnarson et al. "Taxonomy-Aware Evaluation of Vision-Language Models." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00851Markdown
[Snæbjarnarson et al. "Taxonomy-Aware Evaluation of Vision-Language Models." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/snbjarnarson2025cvpr-taxonomyaware/) doi:10.1109/CVPR52734.2025.00851BibTeX
@inproceedings{snbjarnarson2025cvpr-taxonomyaware,
title = {{Taxonomy-Aware Evaluation of Vision-Language Models}},
author = {Snæbjarnarson, Vésteinn and Du, Kevin and Stoehr, Niklas and Belongie, Serge and Cotterell, Ryan and Lang, Nico and Frank, Stella},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {9109-9120},
doi = {10.1109/CVPR52734.2025.00851},
url = {https://mlanthology.org/cvpr/2025/snbjarnarson2025cvpr-taxonomyaware/}
}