Prompting Visual-Language Models for Efficient Video Understanding

Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, Weidi Xie

ECCV 2022

doi:10.1007/978-3-031-19833-5_7 /eccv/2022/ju2022eccv-prompting/

Abstract

Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for zero-shot generalisation. This paper presents a simple but strong baseline to efficiently adapt the pre-trained I-VL model for video understanding tasks, with minimal training. Specifically, we propose to optimise a few random vectors, termed as continuous prompt vectors, that convert video-related tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components. On ten public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, and zero-shot scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite optimising significantly fewer parameters. Due to space limitation, we refer the readers to the arXiv version at https://arxiv.org/abs/2112.04478.

PDF ECCV Semantic Scholar

Cite

Text

Ju et al. "Prompting Visual-Language Models for Efficient Video Understanding." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19833-5_7

Markdown

[Ju et al. "Prompting Visual-Language Models for Efficient Video Understanding." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/ju2022eccv-prompting/) doi:10.1007/978-3-031-19833-5_7

BibTeX

@inproceedings{ju2022eccv-prompting,
  title     = {{Prompting Visual-Language Models for Efficient Video Understanding}},
  author    = {Ju, Chen and Han, Tengda and Zheng, Kunhao and Zhang, Ya and Xie, Weidi},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19833-5_7},
  url       = {https://mlanthology.org/eccv/2022/ju2022eccv-prompting/}
}