Understanding the Effect of Using Semantically Meaningful Tokens for Visual Representation Learning

Kalibhat, Neha Mukund; Kattakinda, Priyatham; Nawathe, Sumit; Zarei, Arman; Seleznev, Nikita; Sharpe, Samuel; Kumar, Senthil; Feizi, Soheil

Understanding the Effect of Using Semantically Meaningful Tokens for Visual Representation Learning

Neha Mukund Kalibhat, Priyatham Kattakinda, Sumit Nawathe, Arman Zarei, Nikita Seleznev, Samuel Sharpe, Senthil Kumar, Soheil Feizi

CVPRW 2025 pp. 3663-3672

/cvprw/2025/kalibhat2025cvprw-understanding/

Abstract

Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from visual data. This paper explores the notion of providing semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. Leveraging off-the-shelf segmentation and scene-graph models, we extract representations of instance segmentation masks (referred to as tangible tokens) and relationships and actions (referred to as intangible tokens). Subsequently, we pre-train a vision-side transformer by incorporating these newly extracted tokens and aligning the resultant embeddings with caption embeddings from a text-side encoder. To capture the structural and semantic relationships among visual tokens, we introduce additive attention weights, which are used to compute self-attention scores. Our experiments on COCO demonstrate notable improvements over ViTs in learned representation quality across text-to-image (+47%) and image-to-text retrieval (+44%) tasks. Furthermore, we showcase the advantages on compositionality benchmarks such as ARO (+18%) and Winoground (+10%).

PDF CVPRW Semantic Scholar

Cite

Text

Kalibhat et al. "Understanding the Effect of Using Semantically Meaningful Tokens for Visual Representation Learning." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Kalibhat et al. "Understanding the Effect of Using Semantically Meaningful Tokens for Visual Representation Learning." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/kalibhat2025cvprw-understanding/)

BibTeX

@inproceedings{kalibhat2025cvprw-understanding,
  title     = {{Understanding the Effect of Using Semantically Meaningful Tokens for Visual Representation Learning}},
  author    = {Kalibhat, Neha Mukund and Kattakinda, Priyatham and Nawathe, Sumit and Zarei, Arman and Seleznev, Nikita and Sharpe, Samuel and Kumar, Senthil and Feizi, Soheil},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {3663-3672},
  url       = {https://mlanthology.org/cvprw/2025/kalibhat2025cvprw-understanding/}
}