SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-Modal Large Language Models

Abstract

We present , a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, visual embeddings and image scales. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. We further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. Based on our proposed joint mixing, exhibits superior multi-modal understanding capabilities on a wide range of applications, with highlighted fine-grained visual recognition abilities such as region-level understanding, caption grounding, document layout detection, and human pose estimation. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.

Cite

Text

Lin et al. "SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-Modal Large Language Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73033-7_3

Markdown

[Lin et al. "SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-Modal Large Language Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/lin2024eccv-sphinx/) doi:10.1007/978-3-031-73033-7_3

BibTeX

@inproceedings{lin2024eccv-sphinx,
  title     = {{SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-Modal Large Language Models}},
  author    = {Lin, Ziyi and Liu, Dongyang and Zhang, Renrui and Gao, Peng and Qiu, Longtian and Xiao, Han and Qiu, Han and Shao, Wenqi and Chen, Keqin and Han, Jiaming and Huang, Siyuan and Zhang, Yichi and He, Xuming and Qiao, Yu and Li, Hongsheng},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73033-7_3},
  url       = {https://mlanthology.org/eccv/2024/lin2024eccv-sphinx/}
}