ReGen: A Good Generative Zero-Shot Video Classifier Should Be Rewarded

Abstract

This paper sets out to solve the following problem: How can we turn a generative video captioning model into an open-world video/action classification model? Video captioning models can naturally produce open-ended free-form descriptions of a given video which, however, might not be discriminative enough for video/action recognition. Unfortunately, when fine-tuned to auto-regress the class names directly, video captioning models overfit the base classes losing their open-world zero-shot capabilities. To alleviate base class overfitting, in this work, we propose to use reinforcement learning to enforce the output of the video captioning model to be more class-level discriminative. Specifically, we propose ReGen, a novel reinforcement learning based framework with a three-fold objective and reward functions: (1) a class-level discrimination reward that enforces the generated caption to be correctly classified into the corresponding action class, (2) a CLIP reward that encourages the generated caption to continue to be descriptive of the input video (i.e. video-specific), and (3) a grammar reward that preserves the grammatical correctness of the caption. We show that ReGen can train a model to produce captions that are: discriminative, video-specific and grammatically correct. Importantly, when evaluated on standard benchmarks for zero- and few-shot action classification, ReGen significantly outperforms the previous state-of-the-art.

Cite

Text

Bulat et al. "ReGen: A Good Generative Zero-Shot Video Classifier Should Be Rewarded." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01244

Markdown

[Bulat et al. "ReGen: A Good Generative Zero-Shot Video Classifier Should Be Rewarded." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/bulat2023iccv-regen/) doi:10.1109/ICCV51070.2023.01244

BibTeX

@inproceedings{bulat2023iccv-regen,
  title     = {{ReGen: A Good Generative Zero-Shot Video Classifier Should Be Rewarded}},
  author    = {Bulat, Adrian and Sanchez, Enrique and Martinez, Brais and Tzimiropoulos, Georgios},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {13523-13533},
  doi       = {10.1109/ICCV51070.2023.01244},
  url       = {https://mlanthology.org/iccv/2023/bulat2023iccv-regen/}
}