Mining Better Samples for Contrastive Learning of Temporal Correspondence
Abstract
We present a novel framework for contrastive learning of pixel-level representation using only unlabeled video. Without the need of ground-truth annotation, our method is capable of collecting well-defined positive correspondences by measuring their confidences and well-defined negative ones by appropriately adjusting their hardness during training. This allows us to suppress the adverse impact of ambiguous matches and prevent a trivial solution from being yielded by too hard or too easy negative samples. To accomplish this, we incorporate three different criteria that ranges from a pixel-level matching confidence to a video-level one into a bottom-up pipeline, and plan a curriculum that is aware of current representation power for the adaptive hardness of negative samples during training. With the proposed method, state-of-the-art performance is attained over the latest approaches on several video label propagation tasks.
Cite
Text
Jeon et al. "Mining Better Samples for Contrastive Learning of Temporal Correspondence." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.00109Markdown
[Jeon et al. "Mining Better Samples for Contrastive Learning of Temporal Correspondence." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/jeon2021cvpr-mining/) doi:10.1109/CVPR46437.2021.00109BibTeX
@inproceedings{jeon2021cvpr-mining,
title = {{Mining Better Samples for Contrastive Learning of Temporal Correspondence}},
author = {Jeon, Sangryul and Min, Dongbo and Kim, Seungryong and Sohn, Kwanghoon},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2021},
pages = {1034-1044},
doi = {10.1109/CVPR46437.2021.00109},
url = {https://mlanthology.org/cvpr/2021/jeon2021cvpr-mining/}
}