Where Elegance Meets Precision: Towards a Compact, Automatic, and Flexible Framework for Multi-Modality Image Fusion and Applications

Abstract

Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a plethora of novel works recently. The prevailing trend involves adopting data-driven methodologies, wherein diverse instruction-following datasets were collected. However, these approaches always face the challenge of limited visual perception capabilities, as they solely utilizing CLIP-like encoders to extract visual information from inputs. Though these encoders are pre-trained on billions of image-text pairs, they still grapple with the information loss dilemma, given that textual captions only partially capture the contents depicted in images. To address this limitation, this paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. Specifically, this work introduces a novel method that incorporates multi-task encoders and existing visual tools into the MLLMs training and inference pipeline, aiming to provide a more comprehensive summarization of visual inputs. Extensive experiments have evaluated its effectiveness of advancing MLLMs, showcasing improved visual perception capability achieved through the integration of visual experts.

Cite

Text

Liu et al. "Where Elegance Meets Precision: Towards a Compact, Automatic, and Flexible Framework for Multi-Modality Image Fusion and Applications." International Joint Conference on Artificial Intelligence, 2024. doi:10.24963/ijcai.2024/123

Markdown

[Liu et al. "Where Elegance Meets Precision: Towards a Compact, Automatic, and Flexible Framework for Multi-Modality Image Fusion and Applications." International Joint Conference on Artificial Intelligence, 2024.](https://mlanthology.org/ijcai/2024/liu2024ijcai-elegance/) doi:10.24963/ijcai.2024/123

BibTeX

@inproceedings{liu2024ijcai-elegance,
  title     = {{Where Elegance Meets Precision: Towards a Compact, Automatic, and Flexible Framework for Multi-Modality Image Fusion and Applications}},
  author    = {Liu, Jinyuan and Wu, Guanyao and Liu, Zhu and Ma, Long and Liu, Risheng and Fan, Xin},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {1110-1118},
  doi       = {10.24963/ijcai.2024/123},
  url       = {https://mlanthology.org/ijcai/2024/liu2024ijcai-elegance/}
}