Towards an Exhaustive Evaluation of Vision-Language Foundation Models

Abstract

Vision-language foundation models have had considerable increase in performances in the last few years. However, there is still a lack comprehensive evaluation methods able to clearly explain their performances. We argue that a more systematic approach to foundation model evaluation would be beneficial to their use in real-world applications. In particular, we think that those models should be evaluated on a broad range of precise capabilities, in order to bring awareness to the width of their scope and their potential weaknesses. To that end, we propose a methodology to build a taxonomy of multimodal capabilities for vision-language foundation models. The proposed taxonomy is intended as a first step towards an exhaustive evaluation of vision-language foundation models.

Cite

Text

Salin et al. "Towards an Exhaustive Evaluation of Vision-Language Foundation Models." IEEE/CVF International Conference on Computer Vision Workshops, 2023. doi:10.1109/ICCVW60793.2023.00041

Markdown

[Salin et al. "Towards an Exhaustive Evaluation of Vision-Language Foundation Models." IEEE/CVF International Conference on Computer Vision Workshops, 2023.](https://mlanthology.org/iccvw/2023/salin2023iccvw-exhaustive/) doi:10.1109/ICCVW60793.2023.00041

BibTeX

@inproceedings{salin2023iccvw-exhaustive,
  title     = {{Towards an Exhaustive Evaluation of Vision-Language Foundation Models}},
  author    = {Salin, Emmanuelle and Ayache, Stéphane and Favre, Benoît},
  booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
  year      = {2023},
  pages     = {339-352},
  doi       = {10.1109/ICCVW60793.2023.00041},
  url       = {https://mlanthology.org/iccvw/2023/salin2023iccvw-exhaustive/}
}