Understanding the Effect of Using Semantically Meaningful Tokens for Visual Representation Learning

Abstract

Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from visual data. This paper explores the notion of providing semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. Leveraging off-the-shelf segmentation and scene-graph models, we extract representations of instance segmentation masks (referred to as tangible tokens) and relationships and actions (referred to as intangible tokens). Subsequently, we pre-train a vision-side transformer by incorporating these newly extracted tokens and aligning the resultant embeddings with caption embeddings from a text-side encoder. To capture the structural and semantic relationships among visual tokens, we introduce additive attention weights, which are used to compute self-attention scores. Our experiments on COCO demonstrate notable improvements over ViTs in learned representation quality across text-to-image (+47%) and image-to-text retrieval (+44%) tasks. Furthermore, we showcase the advantages on compositionality benchmarks such as ARO (+18%) and Winoground (+10%).

Cite

Text

Kalibhat et al. "Understanding the Effect of Using Semantically Meaningful Tokens for Visual Representation Learning." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Kalibhat et al. "Understanding the Effect of Using Semantically Meaningful Tokens for Visual Representation Learning." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/kalibhat2025cvprw-understanding/)

BibTeX

@inproceedings{kalibhat2025cvprw-understanding,
  title     = {{Understanding the Effect of Using Semantically Meaningful Tokens for Visual Representation Learning}},
  author    = {Kalibhat, Neha Mukund and Kattakinda, Priyatham and Nawathe, Sumit and Zarei, Arman and Seleznev, Nikita and Sharpe, Samuel and Kumar, Senthil and Feizi, Soheil},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {3663-3672},
  url       = {https://mlanthology.org/cvprw/2025/kalibhat2025cvprw-understanding/}
}