Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Wang, Haochen; Wang, Yuhao; Zhang, Tao; Zhou, Yikang; Li, Yanwei; Wang, Jiacong; Zheng, Jiani; Tian, Ye; Meng, Jiahao; Huang, Zilong; Mai, Guangcan; Wang, Anran; Tong, Yunhai; Wang, Zhuochen; Li, Xiangtai; Zhang, Zhaoxiang

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang

ICLR 2026

/iclr/2026/wang2026iclr-grasp/

Abstract

While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle with the dense world, i.e., complex scenes requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehensive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GARBench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Empirically, GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GARBench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong comprehension capabilities can be easily transferred to videos. Code and data will be released to the community.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Wang et al. "Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs." International Conference on Learning Representations, 2026.

Markdown

[Wang et al. "Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wang2026iclr-grasp/)

BibTeX

@inproceedings{wang2026iclr-grasp,
  title     = {{Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs}},
  author    = {Wang, Haochen and Wang, Yuhao and Zhang, Tao and Zhou, Yikang and Li, Yanwei and Wang, Jiacong and Zheng, Jiani and Tian, Ye and Meng, Jiahao and Huang, Zilong and Mai, Guangcan and Wang, Anran and Tong, Yunhai and Wang, Zhuochen and Li, Xiangtai and Zhang, Zhaoxiang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wang2026iclr-grasp/}
}