Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities

Abstract

Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains.Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets.Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss.This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery.We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets.When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.https://martius-lab.github.io/videosaur/

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Zadaianchuk et al. "Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities." Neural Information Processing Systems, 2023.

Markdown

[Zadaianchuk et al. "Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/zadaianchuk2023neurips-objectcentric/)

BibTeX

@inproceedings{zadaianchuk2023neurips-objectcentric,
  title     = {{Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities}},
  author    = {Zadaianchuk, Andrii and Seitzer, Maximilian and Martius, Georg},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/zadaianchuk2023neurips-objectcentric/}
}