CogVLM: Visual Expert for Pretrained Language Models
Abstract
We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular \emph{shallow alignment} method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables a deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 17 classic cross-modal benchmarks, including 1) image captioning datasets: NoCaps, Flicker30k, 2) VQA datasets: OKVQA, TextVQA, OCRVQA, ScienceQA, 3) LVLM benchmarks: MM-Vet, MMBench, SEED-Bench, LLaVABench, POPE, MMMU, MathVista, 4) visual grounding datasets: RefCOCO, RefCOCO+, RefCOCOg, Visual7W. Codes and checkpoints are available at Github.
Cite
Text
Wang et al. "CogVLM: Visual Expert for Pretrained Language Models." Neural Information Processing Systems, 2024. doi:10.52202/079017-3860Markdown
[Wang et al. "CogVLM: Visual Expert for Pretrained Language Models." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/wang2024neurips-cogvlm/) doi:10.52202/079017-3860BibTeX
@inproceedings{wang2024neurips-cogvlm,
title = {{CogVLM: Visual Expert for Pretrained Language Models}},
author = {Wang, Weihan and Lv, Qingsong and Yu, Wenmeng and Hong, Wenyi and Qi, Ji and Wang, Yan and Ji, Junhui and Yang, Zhuoyi and Zhao, Lei and Song, Xixuan and Xu, Jiazheng and Chen, Keqin and Xu, Bin and Li, Juanzi and Dong, Yuxiao and Ding, Ming and Tang, Jie},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-3860},
url = {https://mlanthology.org/neurips/2024/wang2024neurips-cogvlm/}
}