Self-Supervised Visual Representation Learning from Hierarchical Grouping

Abstract

We create a framework for bootstrapping visual representation learning from a primitive visual grouping capability. We operationalize grouping via a contour detector that partitions an image into regions, followed by merging of those regions into a tree hierarchy. A small supervised dataset suffices for training this grouping primitive. Across a large unlabeled dataset, we apply this learned primitive to automatically predict hierarchical region structure. These predictions serve as guidance for self-supervised contrastive feature learning: we task a deep network with producing per-pixel embeddings whose pairwise distances respect the region hierarchy. Experiments demonstrate that our approach can serve as state-of-the-art generic pre-training, benefiting downstream tasks. We additionally explore applications to semantic region search and video-based object instance tracking.

Cite

Text

Zhang and Maire. "Self-Supervised Visual Representation Learning from Hierarchical Grouping." Neural Information Processing Systems, 2020.

Markdown

[Zhang and Maire. "Self-Supervised Visual Representation Learning from Hierarchical Grouping." Neural Information Processing Systems, 2020.](https://mlanthology.org/neurips/2020/zhang2020neurips-selfsupervised/)

BibTeX

@inproceedings{zhang2020neurips-selfsupervised,
  title     = {{Self-Supervised Visual Representation Learning from Hierarchical Grouping}},
  author    = {Zhang, Xiao and Maire, Michael},
  booktitle = {Neural Information Processing Systems},
  year      = {2020},
  url       = {https://mlanthology.org/neurips/2020/zhang2020neurips-selfsupervised/}
}