Zoomer: Adaptive Image Focus Optimization for Black-Box MLLM
Abstract
Multimodal large language models (MLLMs) such as GPT-4o, Gemini Pro, and Claude 3.5 have enabled unified reasoning over text and visual inputs, yet they often hallucinate in real-world scenarios—especially when small objects or fine spatial context are involved. We pinpoint two core causes of this failure: the absence of region-adaptive attention and inflexible token budgets that force uniform downsampling, leading to critical information loss. To overcome these limitations, we introduce Zoomer a visual prompting framework that delivers token-efficient, detail-preserving image representations for black-box MLLMs. Zoomer integrates (1) a prompt-aware emphasis module to highlight semantically relevant regions, (2) a spatial-preserving orchestration schema to maintain object relationships, and (3) a budget-aware strategy to optimally allocate tokens between global context and local details. Extensive experiments on nine benchmarks and three commercial MLLMs demonstrate that Zoomer boosts accuracy by up to 27% while cutting image token usage by up to 67\%. Our approach establishes a principled methodology for robust, resource-aware multimodal understanding in settings where model internals are inaccessible.
Cite
Text
Qian et al. "Zoomer: Adaptive Image Focus Optimization for Black-Box MLLM." Transactions on Machine Learning Research, 2025.Markdown
[Qian et al. "Zoomer: Adaptive Image Focus Optimization for Black-Box MLLM." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/qian2025tmlr-zoomer/)BibTeX
@article{qian2025tmlr-zoomer,
title = {{Zoomer: Adaptive Image Focus Optimization for Black-Box MLLM}},
author = {Qian, Jiaxu and Wang, Chendong and Yang, Yifan and Zhang, Chaoyun and Jiang, Huiqiang and Luo, Xufang and Kang, Yu and Lin, Qingwei and Zhang, Anlan and Jiang, Shiqi and Cao, Ting and Mao, Tianjun and Banerjee, Suman and Liu, Guyue and Rajmohan, Saravan and Zhang, Dongmei and Yang, Yuqing and Zhang, Qi and Qiu, Lili},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/qian2025tmlr-zoomer/}
}