Towards an Exhaustive Evaluation of Vision-Language Foundation Models
Abstract
Vision-language foundation models have had considerable increase in performances in the last few years. However, there is still a lack comprehensive evaluation methods able to clearly explain their performances. We argue that a more systematic approach to foundation model evaluation would be beneficial to their use in real-world applications. In particular, we think that those models should be evaluated on a broad range of precise capabilities, in order to bring awareness to the width of their scope and their potential weaknesses. To that end, we propose a methodology to build a taxonomy of multimodal capabilities for vision-language foundation models. The proposed taxonomy is intended as a first step towards an exhaustive evaluation of vision-language foundation models.
Cite
Text
Salin et al. "Towards an Exhaustive Evaluation of Vision-Language Foundation Models." IEEE/CVF International Conference on Computer Vision Workshops, 2023. doi:10.1109/ICCVW60793.2023.00041Markdown
[Salin et al. "Towards an Exhaustive Evaluation of Vision-Language Foundation Models." IEEE/CVF International Conference on Computer Vision Workshops, 2023.](https://mlanthology.org/iccvw/2023/salin2023iccvw-exhaustive/) doi:10.1109/ICCVW60793.2023.00041BibTeX
@inproceedings{salin2023iccvw-exhaustive,
title = {{Towards an Exhaustive Evaluation of Vision-Language Foundation Models}},
author = {Salin, Emmanuelle and Ayache, Stéphane and Favre, Benoît},
booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
year = {2023},
pages = {339-352},
doi = {10.1109/ICCVW60793.2023.00041},
url = {https://mlanthology.org/iccvw/2023/salin2023iccvw-exhaustive/}
}