FLAVA: A Foundational Language and Vision Alignment Model

Abstract

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once---a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.

Cite

Text

Singh et al. "FLAVA: A Foundational Language and Vision Alignment Model." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01519

Markdown

[Singh et al. "FLAVA: A Foundational Language and Vision Alignment Model." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/singh2022cvpr-flava/) doi:10.1109/CVPR52688.2022.01519

BibTeX

@inproceedings{singh2022cvpr-flava,
  title     = {{FLAVA: A Foundational Language and Vision Alignment Model}},
  author    = {Singh, Amanpreet and Hu, Ronghang and Goswami, Vedanuj and Couairon, Guillaume and Galuba, Wojciech and Rohrbach, Marcus and Kiela, Douwe},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {15638-15650},
  doi       = {10.1109/CVPR52688.2022.01519},
  url       = {https://mlanthology.org/cvpr/2022/singh2022cvpr-flava/}
}