Joint-Task Self-Supervised Learning for Temporal Correspondence

Abstract

This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions and establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking. Our self-supervised method even surpasses the fully-supervised affinity feature representation obtained from a ResNet-18 pre-trained on the ImageNet.

Cite

Text

Li et al. "Joint-Task Self-Supervised Learning for Temporal Correspondence." Neural Information Processing Systems, 2019.

Markdown

[Li et al. "Joint-Task Self-Supervised Learning for Temporal Correspondence." Neural Information Processing Systems, 2019.](https://mlanthology.org/neurips/2019/li2019neurips-jointtask/)

BibTeX

@inproceedings{li2019neurips-jointtask,
  title     = {{Joint-Task Self-Supervised Learning for Temporal Correspondence}},
  author    = {Li, Xueting and Liu, Sifei and De Mello, Shalini and Wang, Xiaolong and Kautz, Jan and Yang, Ming-Hsuan},
  booktitle = {Neural Information Processing Systems},
  year      = {2019},
  pages     = {318-328},
  url       = {https://mlanthology.org/neurips/2019/li2019neurips-jointtask/}
}