PaLI: A Jointly-Scaled Multilingual Language-Image Model
Abstract
Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI, a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pretrained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.
Cite
Text
Chen et al. "PaLI: A Jointly-Scaled Multilingual Language-Image Model." International Conference on Learning Representations, 2023.Markdown
[Chen et al. "PaLI: A Jointly-Scaled Multilingual Language-Image Model." International Conference on Learning Representations, 2023.](https://mlanthology.org/iclr/2023/chen2023iclr-pali/)BibTeX
@inproceedings{chen2023iclr-pali,
title = {{PaLI: A Jointly-Scaled Multilingual Language-Image Model}},
author = {Chen, Xi and Wang, Xiao and Changpinyo, Soravit and Piergiovanni, Aj and Padlewski, Piotr and Salz, Daniel and Goodman, Sebastian and Grycner, Adam and Mustafa, Basil and Beyer, Lucas and Kolesnikov, Alexander and Puigcerver, Joan and Ding, Nan and Rong, Keran and Akbari, Hassan and Mishra, Gaurav and Xue, Linting and Thapliyal, Ashish V and Bradbury, James and Kuo, Weicheng and Seyedhosseini, Mojtaba and Jia, Chao and Ayan, Burcu Karagol and Ruiz, Carlos Riquelme and Steiner, Andreas Peter and Angelova, Anelia and Zhai, Xiaohua and Houlsby, Neil and Soricut, Radu},
booktitle = {International Conference on Learning Representations},
year = {2023},
url = {https://mlanthology.org/iclr/2023/chen2023iclr-pali/}
}