Learning Visual Representations via Language-Guided Sampling

Mohamed El Banani, Karan Desai, Justin Johnson

CVPR 2023 pp. 19208-19220

doi:10.1109/CVPR52729.2023.01841 /cvpr/2023/banani2023cvpr-learning/

Abstract

Although an object may appear in numerous contexts, we often describe it in a limited number of ways. Language allows us to abstract away visual variation to represent and communicate concepts. Building on this intuition, we propose an alternative approach to visual representation learning: using language similarity to sample semantically similar image pairs for contrastive learning. Our approach diverges from image-based contrastive learning by sampling view pairs using language similarity instead of hand-crafted augmentations or learned clusters. Our approach also differs from image-text contrastive learning by relying on pre-trained language models to guide the learning rather than directly minimizing a cross-modal loss. Through a series of experiments, we show that language-guided learning yields better features than image-based and image-text representation learning approaches.

PDF CVPR Semantic Scholar

Cite

Text

El Banani et al. "Learning Visual Representations via Language-Guided Sampling." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01841

Markdown

[El Banani et al. "Learning Visual Representations via Language-Guided Sampling." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/banani2023cvpr-learning/) doi:10.1109/CVPR52729.2023.01841

BibTeX

@inproceedings{banani2023cvpr-learning,
  title     = {{Learning Visual Representations via Language-Guided Sampling}},
  author    = {El Banani, Mohamed and Desai, Karan and Johnson, Justin},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {19208-19220},
  doi       = {10.1109/CVPR52729.2023.01841},
  url       = {https://mlanthology.org/cvpr/2023/banani2023cvpr-learning/}
}