Language Is Not All You Need: Aligning Perception with Language Models

Abstract

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train KOSMOS-1 from scratch on web-scale multi-modal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that KOSMOS-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

Cite

Text

Huang et al. "Language Is Not All You Need: Aligning Perception with Language Models." Neural Information Processing Systems, 2023.

Markdown

[Huang et al. "Language Is Not All You Need: Aligning Perception with Language Models." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/huang2023neurips-language/)

BibTeX

@inproceedings{huang2023neurips-language,
  title     = {{Language Is Not All You Need: Aligning Perception with Language Models}},
  author    = {Huang, Shaohan and Dong, Li and Wang, Wenhui and Hao, Yaru and Singhal, Saksham and Ma, Shuming and Lv, Tengchao and Cui, Lei and Mohammed, Owais Khan and Patra, Barun and Liu, Qiang and Aggarwal, Kriti and Chi, Zewen and Bjorck, Nils and Chaudhary, Vishrav and Som, Subhojit and Song, Xia and Wei, Furu},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/huang2023neurips-language/}
}