Mining Better Samples for Contrastive Learning of Temporal Correspondence

Abstract

We present a novel framework for contrastive learning of pixel-level representation using only unlabeled video. Without the need of ground-truth annotation, our method is capable of collecting well-defined positive correspondences by measuring their confidences and well-defined negative ones by appropriately adjusting their hardness during training. This allows us to suppress the adverse impact of ambiguous matches and prevent a trivial solution from being yielded by too hard or too easy negative samples. To accomplish this, we incorporate three different criteria that ranges from a pixel-level matching confidence to a video-level one into a bottom-up pipeline, and plan a curriculum that is aware of current representation power for the adaptive hardness of negative samples during training. With the proposed method, state-of-the-art performance is attained over the latest approaches on several video label propagation tasks.

Cite

Text

Jeon et al. "Mining Better Samples for Contrastive Learning of Temporal Correspondence." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.00109

Markdown

[Jeon et al. "Mining Better Samples for Contrastive Learning of Temporal Correspondence." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/jeon2021cvpr-mining/) doi:10.1109/CVPR46437.2021.00109

BibTeX

@inproceedings{jeon2021cvpr-mining,
  title     = {{Mining Better Samples for Contrastive Learning of Temporal Correspondence}},
  author    = {Jeon, Sangryul and Min, Dongbo and Kim, Seungryong and Sohn, Kwanghoon},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {1034-1044},
  doi       = {10.1109/CVPR46437.2021.00109},
  url       = {https://mlanthology.org/cvpr/2021/jeon2021cvpr-mining/}
}