Long-Context Vision Large Language Models: Empirical Insights and a Baseline
Abstract
The development of long-context large language models (LLMs) has attracted significant interest. However, progress in advancing long-context vision large language models (VLLMs) falls behind, despite their vast potential in applications like high-resolution input, multimodal in-context learning, multi-image understanding, and video understanding. In this paper, we present an empirical study to identify major challenges in developing long-context VLLMs and present a simple yet effective baseline for long-context tasks. By captioning the images separately and aggregating the captions as inputs, we directly alleviate the input length issue and also show that it outperforms other context extension or token reduction strategies effectively.
Cite
Text
Zong et al. "Long-Context Vision Large Language Models: Empirical Insights and a Baseline." ICML 2024 Workshops: LCFM, 2024.Markdown
[Zong et al. "Long-Context Vision Large Language Models: Empirical Insights and a Baseline." ICML 2024 Workshops: LCFM, 2024.](https://mlanthology.org/icmlw/2024/zong2024icmlw-longcontext/)BibTeX
@inproceedings{zong2024icmlw-longcontext,
title = {{Long-Context Vision Large Language Models: Empirical Insights and a Baseline}},
author = {Zong, Yongshuo and Elezi, Ismail and Yang, Yongxin and Deng, Jiankang and Hospedales, Timothy},
booktitle = {ICML 2024 Workshops: LCFM},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/zong2024icmlw-longcontext/}
}