Sequential Modeling Enables Scalable Learning for Large Vision Models

Abstract

We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this we define a common format "visual sentences" in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time.

Cite

Text

Bai et al. "Sequential Modeling Enables Scalable Learning for Large Vision Models." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02157

Markdown

[Bai et al. "Sequential Modeling Enables Scalable Learning for Large Vision Models." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/bai2024cvpr-sequential/) doi:10.1109/CVPR52733.2024.02157

BibTeX

@inproceedings{bai2024cvpr-sequential,
  title     = {{Sequential Modeling Enables Scalable Learning for Large Vision Models}},
  author    = {Bai, Yutong and Geng, Xinyang and Mangalam, Karttikeya and Bar, Amir and Yuille, Alan L. and Darrell, Trevor and Malik, Jitendra and Efros, Alexei A.},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {22861-22872},
  doi       = {10.1109/CVPR52733.2024.02157},
  url       = {https://mlanthology.org/cvpr/2024/bai2024cvpr-sequential/}
}