Unsupervised Open-Vocabulary Object Localization in Videos

Abstract

In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via a slot attention approach and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.

Cite

Text

Fan et al. "Unsupervised Open-Vocabulary Object Localization in Videos." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01264

Markdown

[Fan et al. "Unsupervised Open-Vocabulary Object Localization in Videos." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/fan2023iccv-unsupervised/) doi:10.1109/ICCV51070.2023.01264

BibTeX

@inproceedings{fan2023iccv-unsupervised,
  title     = {{Unsupervised Open-Vocabulary Object Localization in Videos}},
  author    = {Fan, Ke and Bai, Zechen and Xiao, Tianjun and Zietlow, Dominik and Horn, Max and Zhao, Zixu and Simon-Gabriel, Carl-Johann and Shou, Mike Zheng and Locatello, Francesco and Schiele, Bernt and Brox, Thomas and Zhang, Zheng and Fu, Yanwei and He, Tong},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {13747-13755},
  doi       = {10.1109/ICCV51070.2023.01264},
  url       = {https://mlanthology.org/iccv/2023/fan2023iccv-unsupervised/}
}