LLaVA-UHD: An LMM Perceiving Any Aspect Ratio and High-Resolution Images

Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Gao Huang

ECCV 2024

doi:10.1007/978-3-031-73010-8_23 /eccv/2024/guo2024eccv-llavauhd/

Abstract

Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA 1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on black8 benchmarks. Notably, our model built on LLaVA-1.5 336×336 supports 6 times larger (i.e., 672×1008) resolution images, and achieves 5.7 accuracy improvement on TextVQA.

PDF ECCV Semantic Scholar

Cite

Text

Guo et al. "LLaVA-UHD: An LMM Perceiving Any Aspect Ratio and High-Resolution Images." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73010-8_23

Markdown

[Guo et al. "LLaVA-UHD: An LMM Perceiving Any Aspect Ratio and High-Resolution Images." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/guo2024eccv-llavauhd/) doi:10.1007/978-3-031-73010-8_23

BibTeX

@inproceedings{guo2024eccv-llavauhd,
  title     = {{LLaVA-UHD: An LMM Perceiving Any Aspect Ratio and High-Resolution Images}},
  author    = {Guo, Zonghao and Xu, Ruyi and Yao, Yuan and Cui, Junbo and Ni, Zanlin and Ge, Chunjiang and Chua, Tat-Seng and Liu, Zhiyuan and Huang, Gao},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73010-8_23},
  url       = {https://mlanthology.org/eccv/2024/guo2024eccv-llavauhd/}
}