Self-Supervised Learning of Contextualized Local Visual Embeddings

Abstract

We present Contextualized Local Visual Embeddings (CLoVE), a self-supervised convolutional-based method that learns representations suited for dense prediction tasks. CLoVE deviates from current methods and optimizes a single loss function that operates at the level of contextualized local embeddings learned from output feature maps of convolution neural network (CNN) encoders. To learn contextualized embeddings, CLoVE proposes a normalized mult-head self-attention layer that combines local features from different parts of an image based on similarity. We extensively benchmark CLoVE’s pre-trained representations on multiple datasets. CLoVE reaches state-of-the-art performance for CNN-based architectures in 4 dense prediction downstream tasks, including object detection, instance segmentation, keypoint detection, and dense pose estimation. Code: https://github.com/sthalles/CLoVE.

Cite

Text

Silva et al. "Self-Supervised Learning of Contextualized Local Visual Embeddings." IEEE/CVF International Conference on Computer Vision Workshops, 2023. doi:10.1109/ICCVW60793.2023.00025

Markdown

[Silva et al. "Self-Supervised Learning of Contextualized Local Visual Embeddings." IEEE/CVF International Conference on Computer Vision Workshops, 2023.](https://mlanthology.org/iccvw/2023/silva2023iccvw-selfsupervised/) doi:10.1109/ICCVW60793.2023.00025

BibTeX

@inproceedings{silva2023iccvw-selfsupervised,
  title     = {{Self-Supervised Learning of Contextualized Local Visual Embeddings}},
  author    = {Silva, Thalles and Pedrini, Hélio and Rivera, Adín Ramírez},
  booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
  year      = {2023},
  pages     = {177-186},
  doi       = {10.1109/ICCVW60793.2023.00025},
  url       = {https://mlanthology.org/iccvw/2023/silva2023iccvw-selfsupervised/}
}