Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

Zhao, Han; Zhang, Min; Zhao, Wei; Ding, Pengxiang; Huang, Siteng; Wang, Donglin

doi:10.1609/AAAI.V39I10.33131

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang

AAAI 2025 pp. 10421-10429

doi:10.1609/AAAI.V39I10.33131 /aaai/2025/zhao2025aaai-cobra/

Abstract

In recent years, applying multi-modal large language models (MLLMs) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, MLLMs comprise the well-known Transformer network, which has a less efficient quadratic computation complexity. In this study, we introduce Cobra, a multi-modal large-scale language model built upon a state-space model, which has demonstrated significant potential in efficiently handling long sequences with fast inference and linear scalability concerning sequence length. Specifically, Cobra involves replacing Transformer-based backbone models (e.g., LLaMA or Phi) with pre-trained Mamba language models. We then empirically explore effective strategies for aligning visual and textual modalities and integrating various pre-trained Mamba model variants with visual encoders. Experiments across various multi-modal benchmarks demonstrate that: (i) Cobra performs 3× ∼ 4× faster than the most computationally efficient state-of-the-art methods, e.g., LLaVA-Phi and MobileVLM v2. Additionally, its performance is significantly enhanced thanks to the implementation of linear sequential modeling. (ii) Cobra fine-tunes a small parameter (∼48% of model parameters), leading to a significant improvement in overall performance compared to LLaVA.

PDF AAAI Semantic Scholar

Cite

Text

Zhao et al. "Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I10.33131

Markdown

[Zhao et al. "Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zhao2025aaai-cobra/) doi:10.1609/AAAI.V39I10.33131

BibTeX

@inproceedings{zhao2025aaai-cobra,
  title     = {{Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference}},
  author    = {Zhao, Han and Zhang, Min and Zhao, Wei and Ding, Pengxiang and Huang, Siteng and Wang, Donglin},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {10421-10429},
  doi       = {10.1609/AAAI.V39I10.33131},
  url       = {https://mlanthology.org/aaai/2025/zhao2025aaai-cobra/}
}