Leveraging Task-Specific Pre-Training to Reason Across Images and Videos

Abstract

We explore the Reasoning Across Images and Video (RAIV) task, which requires models to reason on a pair of visual inputs comprising various combinations of images and/or videos. Previous work in this area has been limited to image pairs focusing primarily on the existence and/or cardinality of objects. To address this, we leverage existing datasets with rich annotations to generate semantically meaningful queries about actions, objects, and their relationships. We introduce new datasets that encompass visually similar inputs, reasoning over images, across images and videos, or across videos. Recognizing the distinct nature of RAIV compared to existing pre-training objectives which work on single image-text pairs, we explore task-specific pre-training, wherein a pre-trained model is trained on an objective similar to downstream tasks without utilizing fine-tuning datasets. Experiments with several state-of-the-art pre-trained image-language models reveal that task-specific pre-training significantly enhances performance on downstream datasets, even in the absence of additional pre-training data. We provide further ablative studies to guide future work.

Cite

Text

Sadhu and Nevatia. "Leveraging Task-Specific Pre-Training to Reason Across Images and Videos." Winter Conference on Applications of Computer Vision, 2024.

Markdown

[Sadhu and Nevatia. "Leveraging Task-Specific Pre-Training to Reason Across Images and Videos." Winter Conference on Applications of Computer Vision, 2024.](https://mlanthology.org/wacv/2024/sadhu2024wacv-leveraging/)

BibTeX

@inproceedings{sadhu2024wacv-leveraging,
  title     = {{Leveraging Task-Specific Pre-Training to Reason Across Images and Videos}},
  author    = {Sadhu, Arka and Nevatia, Ram},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2024},
  pages     = {5794-5804},
  url       = {https://mlanthology.org/wacv/2024/sadhu2024wacv-leveraging/}
}