MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge
Abstract
Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best zero-shot action recognition performance. While previous work relied on large-scale, fully-annotated data, in this work we propose an unsupervised approach. We adapt a VL model for zero-shot and few-shot action recognition using a collection of unlabeled videos and an unpaired action dictionary. Based on that, we leverage Large Language Models and VL models to build a text bag for each unlabeled video via matching, text expansion and captioning. We use those bags in a Multiple Instance Learning setup to adapt an image-text backbone to video data. Although finetuned on unlabeled video data, our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks, improving the base VL model performance by up to 14%, and even comparing favorably to fully-supervised baselines in both zero-shot and few-shot video recognition transfer. The code is provided in supplementary and will be released upon acceptance.
Cite
Text
Lin et al. "MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00267Markdown
[Lin et al. "MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/lin2023iccv-match/) doi:10.1109/ICCV51070.2023.00267BibTeX
@inproceedings{lin2023iccv-match,
title = {{MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge}},
author = {Lin, Wei and Karlinsky, Leonid and Shvetsova, Nina and Possegger, Horst and Kozinski, Mateusz and Panda, Rameswar and Feris, Rogerio and Kuehne, Hilde and Bischof, Horst},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {2851-2862},
doi = {10.1109/ICCV51070.2023.00267},
url = {https://mlanthology.org/iccv/2023/lin2023iccv-match/}
}