GiT: Towards Generalist Vision Transformer Through Universal Language Interface

Abstract

This paper proposes a simple, yet effective framework, called , simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g., GPT) widely used in large language models (LLMs), we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike language modeling, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g. captioning), over sparse perception (e.g. detection), to dense prediction (e.g. segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code and models are available at redhttps: //github.com/Haiyang-W/GiT.

Cite

Text

Wang et al. "GiT: Towards Generalist Vision Transformer Through Universal Language Interface." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73397-0_4

Markdown

[Wang et al. "GiT: Towards Generalist Vision Transformer Through Universal Language Interface." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/wang2024eccv-git/) doi:10.1007/978-3-031-73397-0_4

BibTeX

@inproceedings{wang2024eccv-git,
  title     = {{GiT: Towards Generalist Vision Transformer Through Universal Language Interface}},
  author    = {Wang, Haiyang and Tang, Hao and Jiang, Li and Shi, Shaoshuai and Naeem, Muhammad Ferjad and Li, Hongsheng and Schiele, Bernt and Wang, Liwei},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73397-0_4},
  url       = {https://mlanthology.org/eccv/2024/wang2024eccv-git/}
}