Learning Temporal Dynamics from Cycles in Narrated Video

Abstract

Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community. We introduce a self-supervised approach to this problem that solves a multi-modal temporal cycle consistency objective, MMCC, jointly in vision and language. This objective requires a model to learn modality-agnostic functions to predict the future and past that undo each other when composed. We hypothesize that a model trained on this objective will discover long-term temporal dynamics in video. We verify this hypothesis by using the resultant visual representations and predictive models as-is to solve a variety of downstream tasks. Our method outperforms state-of-the-art self-supervised video prediction methods on future action anticipation, temporal image ordering, and arrow-of-time classification tasks, without training on target datasets or their labels.

Cite

Text

Epstein et al. "Learning Temporal Dynamics from Cycles in Narrated Video." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00151

Markdown

[Epstein et al. "Learning Temporal Dynamics from Cycles in Narrated Video." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/epstein2021iccv-learning/) doi:10.1109/ICCV48922.2021.00151

BibTeX

@inproceedings{epstein2021iccv-learning,
  title     = {{Learning Temporal Dynamics from Cycles in Narrated Video}},
  author    = {Epstein, Dave and Wu, Jiajun and Schmid, Cordelia and Sun, Chen},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {1480-1489},
  doi       = {10.1109/ICCV48922.2021.00151},
  url       = {https://mlanthology.org/iccv/2021/epstein2021iccv-learning/}
}