Video OWL-ViT: Temporally-Consistent Open-World Localization in Video

Abstract

We present an architecture and a training recipe that adapts pretrained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector. We evaluate our model on the challenging TAO-OW benchmark and demonstrate that open-world capabilities, learned from large-scale image-text pretraining, can be transferred successfully to open-world localization across diverse videos.

Cite

Text

Heigold et al. "Video OWL-ViT: Temporally-Consistent Open-World Localization in Video." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01269

Markdown

[Heigold et al. "Video OWL-ViT: Temporally-Consistent Open-World Localization in Video." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/heigold2023iccv-video/) doi:10.1109/ICCV51070.2023.01269

BibTeX

@inproceedings{heigold2023iccv-video,
  title     = {{Video OWL-ViT: Temporally-Consistent Open-World Localization in Video}},
  author    = {Heigold, Georg and Minderer, Matthias and Gritsenko, Alexey and Bewley, Alex and Keysers, Daniel and Lučić, Mario and Yu, Fisher and Kipf, Thomas},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {13802-13811},
  doi       = {10.1109/ICCV51070.2023.01269},
  url       = {https://mlanthology.org/iccv/2023/heigold2023iccv-video/}
}