Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Lin, Weifeng; Wei, Xinyu; An, Ruichuan; Gao, Peng; Zou, Bocheng; Luo, Yulin; Huang, Siyuan; Zhang, Shanghang; Li, Hongsheng

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, Hongsheng Li

ICLR 2025

/iclr/2025/lin2025iclr-drawandunderstand/

Abstract

In this paper, we present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs). Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension. In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts (such as points, bounding boxes, and free-form shapes) alongside language understanding. Additionally, we introduce MDVP-Instruct-Data, a multi-domain dataset featuring 1.2 million image-visual prompt-text triplets, including natural images, document images, scene text images, mobile/web screenshots, and remote sensing images. Building on this dataset, we introduce MDVP-Bench, a challenging benchmark designed to evaluate a model's ability to understand visual prompting instructions. The experimental results demonstrate that our framework can be easily and effectively applied to various MLLMs, such as SPHINX-X and LLaVA. After training with MDVP-Instruct-Data and image-level instruction datasets, our models exhibit impressive multimodal interaction capabilities and pixel-level understanding, while maintaining their image-level visual perception performance.

PDF ICLR Semantic Scholar

Cite

Text

Lin et al. "Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want." International Conference on Learning Representations, 2025.

Markdown

[Lin et al. "Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/lin2025iclr-drawandunderstand/)

BibTeX

@inproceedings{lin2025iclr-drawandunderstand,
  title     = {{Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want}},
  author    = {Lin, Weifeng and Wei, Xinyu and An, Ruichuan and Gao, Peng and Zou, Bocheng and Luo, Yulin and Huang, Siyuan and Zhang, Shanghang and Li, Hongsheng},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/lin2025iclr-drawandunderstand/}
}