InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4k HD

Dong, Xiaoyi; Zhang, Pan; Zang, Yuhang; Cao, Yuhang; Wang, Bin; Ouyang, Linke; Zhang, Songyang; Duan, Haodong; Zhang, Wenwei; Li, Yining; Yan, Hang; Gao, Yang; Chen, Zhe; Zhang, Xinyue; Li, Wei; Li, Jingwen; Wang, Wenhai; Chen, Kai; He, Conghui; Zhang, Xingcheng; Dai, Jifeng; Qiao, Yu; Lin, Dahua; Wang, Jiaqi

doi:10.52202/079017-1348

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4k HD

NeurIPS 2024

doi:10.52202/079017-1348 /neurips/2024/dong2024neurips-internlmxcomposer24khd/

Abstract

The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 $\times$ 1500 pixels and constrained to a relatively narrow resolution range. This paper represents InternLM-XComposer2-4KHD, a groundbreaking exploration into elevating LVLM resolution capabilities up to 4K HD (3840 × 1600) and beyond. Concurrently, considering the ultra-high resolution may not be necessary in all scenarios, it supports a wide range of diverse resolutions from 336 pixels to 4K standard, significantly broadening its scope of applicability. Specifically, this research advances the patch division paradigm by introducing a novel extension: dynamic resolution with automatic patch configuration. It maintains the training image aspect ratios while automatically varying patch counts and configuring layouts based on a pre-trained Vision Transformer (ViT) (336 $\times$ 336), leading to dynamic training resolution from 336 pixels to 4K standard. Our research demonstrates that scaling training resolution up to 4K HD leads to consistent performance enhancements without hitting the ceiling of potential improvements. InternLM-XComposer2-4KHD shows superb capability that matches or even surpasses GPT-4V and Gemini Pro in 10 of the 16 benchmarks.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Dong et al. "InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4k HD." Neural Information Processing Systems, 2024. doi:10.52202/079017-1348

Markdown

[Dong et al. "InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4k HD." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/dong2024neurips-internlmxcomposer24khd/) doi:10.52202/079017-1348

BibTeX

@inproceedings{dong2024neurips-internlmxcomposer24khd,
  title     = {{InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4k HD}},
  author    = {Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Cao, Yuhang and Wang, Bin and Ouyang, Linke and Zhang, Songyang and Duan, Haodong and Zhang, Wenwei and Li, Yining and Yan, Hang and Gao, Yang and Chen, Zhe and Zhang, Xinyue and Li, Wei and Li, Jingwen and Wang, Wenhai and Chen, Kai and He, Conghui and Zhang, Xingcheng and Dai, Jifeng and Qiao, Yu and Lin, Dahua and Wang, Jiaqi},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-1348},
  url       = {https://mlanthology.org/neurips/2024/dong2024neurips-internlmxcomposer24khd/}
}