Osprey: Pixel Understanding with Visual Instruction Tuning
Abstract
Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However current MLLMs primarily focus on image-level or box-level understanding falling short in achieving fine-grained vision-language alignment at pixel level. Besides the lack of mask-based instruction data limits their advancements. In this paper we propose Osprey a mask-text instruction tuning approach to extend MLLMs by incorporating fine-grained mask regions into language instruction aiming at achieving pixel-wise visual understanding. To achieve this goal we first meticulously curate a mask-based region-text dataset with 724K samples and then design a vision-language model by injecting pixel-level representation into LLM. Specifically Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey's superiority in various region understanding tasks showcasing its new capability for pixel-level instruction tuning. In particular Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics. The source code dataset and demo can be found at https://github.com/CircleRadon/Osprey.
Cite
Text
Yuan et al. "Osprey: Pixel Understanding with Visual Instruction Tuning." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02664Markdown
[Yuan et al. "Osprey: Pixel Understanding with Visual Instruction Tuning." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/yuan2024cvpr-osprey/) doi:10.1109/CVPR52733.2024.02664BibTeX
@inproceedings{yuan2024cvpr-osprey,
title = {{Osprey: Pixel Understanding with Visual Instruction Tuning}},
author = {Yuan, Yuqian and Li, Wentong and Liu, Jian and Tang, Dongqi and Luo, Xinjie and Qin, Chi and Zhang, Lei and Zhu, Jianke},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {28202-28211},
doi = {10.1109/CVPR52733.2024.02664},
url = {https://mlanthology.org/cvpr/2024/yuan2024cvpr-osprey/}
}