Where Elegance Meets Precision: Towards a Compact, Automatic, and Flexible Framework for Multi-Modality Image Fusion and Applications
Abstract
Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a plethora of novel works recently. The prevailing trend involves adopting data-driven methodologies, wherein diverse instruction-following datasets were collected. However, these approaches always face the challenge of limited visual perception capabilities, as they solely utilizing CLIP-like encoders to extract visual information from inputs. Though these encoders are pre-trained on billions of image-text pairs, they still grapple with the information loss dilemma, given that textual captions only partially capture the contents depicted in images. To address this limitation, this paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. Specifically, this work introduces a novel method that incorporates multi-task encoders and existing visual tools into the MLLMs training and inference pipeline, aiming to provide a more comprehensive summarization of visual inputs. Extensive experiments have evaluated its effectiveness of advancing MLLMs, showcasing improved visual perception capability achieved through the integration of visual experts.
Cite
Text
Liu et al. "Where Elegance Meets Precision: Towards a Compact, Automatic, and Flexible Framework for Multi-Modality Image Fusion and Applications." International Joint Conference on Artificial Intelligence, 2024. doi:10.24963/ijcai.2024/123Markdown
[Liu et al. "Where Elegance Meets Precision: Towards a Compact, Automatic, and Flexible Framework for Multi-Modality Image Fusion and Applications." International Joint Conference on Artificial Intelligence, 2024.](https://mlanthology.org/ijcai/2024/liu2024ijcai-elegance/) doi:10.24963/ijcai.2024/123BibTeX
@inproceedings{liu2024ijcai-elegance,
title = {{Where Elegance Meets Precision: Towards a Compact, Automatic, and Flexible Framework for Multi-Modality Image Fusion and Applications}},
author = {Liu, Jinyuan and Wu, Guanyao and Liu, Zhu and Ma, Long and Liu, Risheng and Fan, Xin},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2024},
pages = {1110-1118},
doi = {10.24963/ijcai.2024/123},
url = {https://mlanthology.org/ijcai/2024/liu2024ijcai-elegance/}
}