Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model

Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chellappa, Amjad Almahairi

CVPR 2024 pp. 14076-14088

doi:10.1109/CVPR52733.2024.01335 /cvpr/2024/pramanick2024cvpr-jack/

Abstract

The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems unifying various vision-language (VL) tasks by instruction tuning. However due to the enormous diversity in input-output formats in the vision domain existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work we introduce VistaLLM a powerful visual system that addresses coarse- and fine grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM we curate CoinIt a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task AttCoSeg (Attribute-level Co Segmentation) which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across many downstream tasks. Our project page can be found at https://shramanpramanick.github.io/VistaLLM/

PDF CVPR Semantic Scholar

Cite

Text

Pramanick et al. "Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01335

Markdown

[Pramanick et al. "Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/pramanick2024cvpr-jack/) doi:10.1109/CVPR52733.2024.01335

BibTeX

@inproceedings{pramanick2024cvpr-jack,
  title     = {{Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model}},
  author    = {Pramanick, Shraman and Han, Guangxing and Hou, Rui and Nag, Sayan and Lim, Ser-Nam and Ballas, Nicolas and Wang, Qifan and Chellappa, Rama and Almahairi, Amjad},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {14076-14088},
  doi       = {10.1109/CVPR52733.2024.01335},
  url       = {https://mlanthology.org/cvpr/2024/pramanick2024cvpr-jack/}
}