Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Abstract

In this paper, we present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs). Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension. In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts (such as points, bounding boxes, and free-form shapes) alongside language understanding. Additionally, we introduce MDVP-Instruct-Data, a multi-domain dataset featuring 1.2 million image-visual prompt-text triplets, including natural images, document images, scene text images, mobile/web screenshots, and remote sensing images. Building on this dataset, we introduce MDVP-Bench, a challenging benchmark designed to evaluate a model's ability to understand visual prompting instructions. The experimental results demonstrate that our framework can be easily and effectively applied to various MLLMs, such as SPHINX-X and LLaVA. After training with MDVP-Instruct-Data and image-level instruction datasets, our models exhibit impressive multimodal interaction capabilities and pixel-level understanding, while maintaining their image-level visual perception performance.

Cite

Text

Lin et al. "Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want." International Conference on Learning Representations, 2025.

Markdown

[Lin et al. "Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/lin2025iclr-drawandunderstand/)

BibTeX

@inproceedings{lin2025iclr-drawandunderstand,
  title     = {{Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want}},
  author    = {Lin, Weifeng and Wei, Xinyu and An, Ruichuan and Gao, Peng and Zou, Bocheng and Luo, Yulin and Huang, Siyuan and Zhang, Shanghang and Li, Hongsheng},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/lin2025iclr-drawandunderstand/}
}