Object Recognition as Next Token Prediction

Yue, Kaiyu; Chen, Bor-Chun; Geiping, Jonas; Li, Hengduo; Goldstein, Tom; Lim, Ser-Nam

doi:10.1109/CVPR52733.2024.01575

Object Recognition as Next Token Prediction

Kaiyu Yue, Bor-Chun Chen, Jonas Geiping, Hengduo Li, Tom Goldstein, Ser-Nam Lim

CVPR 2024 pp. 16645-16656

doi:10.1109/CVPR52733.2024.01575 /cvpr/2024/yue2024cvpr-object/

Abstract

We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in auto-regression we customize a non-causal attention mask for the decoder incorporating two key features: modeling tokens from different labels to be independent and treating image tokens as a prefix. This masking mechanism inspires an efficient method -- one-shot sampling -- to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference. To further enhance the efficiency we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained language model. This approach yields a decoder that matches the full model's performance while being notably more efficient. The code is available at https://github.com/kaiyuyue/nxtp.

PDF CVPR Semantic Scholar

Cite

Text

Yue et al. "Object Recognition as Next Token Prediction." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01575

Markdown

[Yue et al. "Object Recognition as Next Token Prediction." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/yue2024cvpr-object/) doi:10.1109/CVPR52733.2024.01575

BibTeX

@inproceedings{yue2024cvpr-object,
  title     = {{Object Recognition as Next Token Prediction}},
  author    = {Yue, Kaiyu and Chen, Bor-Chun and Geiping, Jonas and Li, Hengduo and Goldstein, Tom and Lim, Ser-Nam},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {16645-16656},
  doi       = {10.1109/CVPR52733.2024.01575},
  url       = {https://mlanthology.org/cvpr/2024/yue2024cvpr-object/}
}