QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining

Chickering, Kyle R.; Li, Bangzheng; Chen, Muhao

QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining

Kyle R. Chickering, Bangzheng Li, Muhao Chen

ICLR 2026

/iclr/2026/chickering2026iclr-qlip/

Abstract

Multimodal Large Language Models (MLLMs) encode images into visual tokens, aligning visual and textual signals within a shared latent space to facilitate cross-modal representation learning. The CLIP model is a widely adopted foundational vision language model whose vision encoder has played a critical role in the development of MLLMs such as LLaVA. However, the CLIP vision encoder suffers from notable limitations including being constrained to only handling fixed input resolutions and a failure to produce separated embeddings for dissimilar images. Replacing the vision encoder of an existing model typically incurs substantial computational costs because such a change often necessitates retraining the entire model pipeline. In this work, we identify two factors which underlie the limitations of the CLIP vision encoder: mesoscopic bias and interpolation bias. To address these issues, we propose QLIP, a drop-in replacement for CLIP that can be seamlessly integrated with existing MLLMs with only a few lines of code and can enhance both coarse-grained and fine-grained visual understanding, without re-training. QLIP is designed around an image quadtree which replaces the standard uniform grid patches with a novel content aware patchification. Our experimental results demonstrate that QLIP improves the general visual question answering accuracy of the LLaVA-1.5 model series across various model sizes—without requiring retraining or fine-tuning of the full MLLM. Notably, QLIP boosts detailed understanding performance on the challenging $V^*$ benchmark by up to 13.6%.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Chickering et al. "QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining." International Conference on Learning Representations, 2026.

Markdown

[Chickering et al. "QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/chickering2026iclr-qlip/)

BibTeX

@inproceedings{chickering2026iclr-qlip,
  title     = {{QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining}},
  author    = {Chickering, Kyle R. and Li, Bangzheng and Chen, Muhao},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/chickering2026iclr-qlip/}
}