Zero-Shot Action Recognition with Transformer-Based Video Semantic Embedding

Abstract

While video action recognition has been an active area of research for several years, zero-shot action recognition has only recently started gaining traction. In this work, we propose a novel end-to-end trained transformer model which is capable of capturing long range spatiotemporal dependencies efficiently, contrary to existing approaches which use 3D-CNNs. Moreover, to address a common ambiguity in the existing works about classes that can be considered as previously unseen, we propose a new experimentation setup that satisfies the zero-shot learning premise for action recognition by avoiding overlap between the training and testing classes. The proposed approach significantly outperforms the state of the arts in zero-shot action recognition in terms of the the top-1 accuracy on UCF-101, HMDB-51 and ActivityNet datasets.

Cite

Text

Doshi and Yilmaz. "Zero-Shot Action Recognition with Transformer-Based Video Semantic Embedding." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023. doi:10.1109/CVPRW59228.2023.00514

Markdown

[Doshi and Yilmaz. "Zero-Shot Action Recognition with Transformer-Based Video Semantic Embedding." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023.](https://mlanthology.org/cvprw/2023/doshi2023cvprw-zeroshot/) doi:10.1109/CVPRW59228.2023.00514

BibTeX

@inproceedings{doshi2023cvprw-zeroshot,
  title     = {{Zero-Shot Action Recognition with Transformer-Based Video Semantic Embedding}},
  author    = {Doshi, Keval and Yilmaz, Yasin},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2023},
  pages     = {4859-4868},
  doi       = {10.1109/CVPRW59228.2023.00514},
  url       = {https://mlanthology.org/cvprw/2023/doshi2023cvprw-zeroshot/}
}