Visual Perception by Large Language Model’s Weights

Ma, Feipeng; Xue, Hongwei; Zhou, Yizhou; Wang, Guangting; Rao, Fengyun; Yan, Shilin; Zhang, Yueyi; Wu, Siying; Shou, Mike Zheng; Sun, Xiaoyan

doi:10.52202/079017-0898

Visual Perception by Large Language Model’s Weights

Feipeng Ma, Hongwei Xue, Yizhou Zhou, Guangting Wang, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun

NeurIPS 2024

doi:10.52202/079017-0898 /neurips/2024/ma2024neurips-visual/

Abstract

Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs) and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM's weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to perceptual weights with low-rank property, exhibiting a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference. Code and models are released at \url{https://github.com/FeipengMa6/VLoRA}.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Ma et al. "Visual Perception by Large Language Model’s Weights." Neural Information Processing Systems, 2024. doi:10.52202/079017-0898

Markdown

[Ma et al. "Visual Perception by Large Language Model’s Weights." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/ma2024neurips-visual/) doi:10.52202/079017-0898

BibTeX

@inproceedings{ma2024neurips-visual,
  title     = {{Visual Perception by Large Language Model’s Weights}},
  author    = {Ma, Feipeng and Xue, Hongwei and Zhou, Yizhou and Wang, Guangting and Rao, Fengyun and Yan, Shilin and Zhang, Yueyi and Wu, Siying and Shou, Mike Zheng and Sun, Xiaoyan},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-0898},
  url       = {https://mlanthology.org/neurips/2024/ma2024neurips-visual/}
}