Osprey: Pixel Understanding with Visual Instruction Tuning

Abstract

Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However current MLLMs primarily focus on image-level or box-level understanding falling short in achieving fine-grained vision-language alignment at pixel level. Besides the lack of mask-based instruction data limits their advancements. In this paper we propose Osprey a mask-text instruction tuning approach to extend MLLMs by incorporating fine-grained mask regions into language instruction aiming at achieving pixel-wise visual understanding. To achieve this goal we first meticulously curate a mask-based region-text dataset with 724K samples and then design a vision-language model by injecting pixel-level representation into LLM. Specifically Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey's superiority in various region understanding tasks showcasing its new capability for pixel-level instruction tuning. In particular Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics. The source code dataset and demo can be found at https://github.com/CircleRadon/Osprey.

Cite

Text

Yuan et al. "Osprey: Pixel Understanding with Visual Instruction Tuning." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02664

Markdown

[Yuan et al. "Osprey: Pixel Understanding with Visual Instruction Tuning." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/yuan2024cvpr-osprey/) doi:10.1109/CVPR52733.2024.02664

BibTeX

@inproceedings{yuan2024cvpr-osprey,
  title     = {{Osprey: Pixel Understanding with Visual Instruction Tuning}},
  author    = {Yuan, Yuqian and Li, Wentong and Liu, Jian and Tang, Dongqi and Luo, Xinjie and Qin, Chi and Zhang, Lei and Zhu, Jianke},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {28202-28211},
  doi       = {10.1109/CVPR52733.2024.02664},
  url       = {https://mlanthology.org/cvpr/2024/yuan2024cvpr-osprey/}
}