VisionLLaMA: A Unified Llama Backbone for Vision Tasks

Abstract

We all know that large language models are built on top of a transformer-based architecture to process textual inputs. For example, the LLaMA family of models stands out among many open-source implementations. Can the same transformer be used to process 2D images? In this paper, we answer this question by unveiling a LLaMA-like vision transformer in plain and pyramid forms, termed , which is tailored for this purpose. is a unified and generic modeling framework for solving most vision tasks. We extensively evaluate its effectiveness using typical pre-training paradigms in a good portion of downstream tasks of image perception and especially image generation. In many cases, has exhibited substantial gains over the previous state-of-the-art vision transformers. It is our hope that researchers in computer vision can apply presented here to solve various specific image generation and perception tasks. Code is at: https://github.com/Meituan-AutoML/VisionLLaMA

Cite

Text

Chu et al. "VisionLLaMA: A Unified Llama Backbone for Vision Tasks." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72848-8_1

Markdown

[Chu et al. "VisionLLaMA: A Unified Llama Backbone for Vision Tasks." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/chu2024eccv-visionllama/) doi:10.1007/978-3-031-72848-8_1

BibTeX

@inproceedings{chu2024eccv-visionllama,
  title     = {{VisionLLaMA: A Unified Llama Backbone for Vision Tasks}},
  author    = {Chu, Xiangxiang and Su, Jianlin and Zhang, Bo and Shen, Chunhua},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72848-8_1},
  url       = {https://mlanthology.org/eccv/2024/chu2024eccv-visionllama/}
}