Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting
Abstract
Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Finetuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this, recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes and models will be publicly released.
Cite
Text
Wasim et al. "Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.02206Markdown
[Wasim et al. "Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/wasim2023cvpr-vitaclip/) doi:10.1109/CVPR52729.2023.02206BibTeX
@inproceedings{wasim2023cvpr-vitaclip,
title = {{Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting}},
author = {Wasim, Syed Talal and Naseer, Muzammal and Khan, Salman and Khan, Fahad Shahbaz and Shah, Mubarak},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2023},
pages = {23034-23044},
doi = {10.1109/CVPR52729.2023.02206},
url = {https://mlanthology.org/cvpr/2023/wasim2023cvpr-vitaclip/}
}