Taxonomy-Aware Evaluation of Vision-Language Models

Abstract

When a vision-language model (VLM) is prompted to identify an entity depicted in an image, it may answer "I see a conifer," rather than the specific label "Norway spruce". This raises two issues for evaluation: Firstly, the unconstrained generated text needs to be mapped to the evaluation label space (i.e., "conifer"). Secondly, a useful classification measure should give partial credit to less specific, but not incorrect, answers ("Norway spruce" being a type of "conifer"). To meet these requirements, we propose a framework for evaluating unconstrained text predictions such as those generated from a vision-language model against a taxonomy. Specifically, we propose the use of hierarchical precision and recall measures to assess the level of correctness and specificity of predictions with regard to a taxonomy. Experimentally, we first show that existing text similarity measures do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the generated text and the ground truth labels. Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our proposed taxonomic evaluation scheme.

Cite

Text

Snæbjarnarson et al. "Taxonomy-Aware Evaluation of Vision-Language Models." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00851

Markdown

[Snæbjarnarson et al. "Taxonomy-Aware Evaluation of Vision-Language Models." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/snbjarnarson2025cvpr-taxonomyaware/) doi:10.1109/CVPR52734.2025.00851

BibTeX

@inproceedings{snbjarnarson2025cvpr-taxonomyaware,
  title     = {{Taxonomy-Aware Evaluation of Vision-Language Models}},
  author    = {Snæbjarnarson, Vésteinn and Du, Kevin and Stoehr, Niklas and Belongie, Serge and Cotterell, Ryan and Lang, Nico and Frank, Stella},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {9109-9120},
  doi       = {10.1109/CVPR52734.2025.00851},
  url       = {https://mlanthology.org/cvpr/2025/snbjarnarson2025cvpr-taxonomyaware/}
}