Few-Shot Transformation of Common Actions into Time and Space

Abstract

This paper introduces the task of few-shot common action localization in time and space. Given a few trimmed support videos containing the same but unknown action, we strive for spatio-temporal localization of that action in a long untrimmed query video. We do not require any class labels, interval bounds, or bounding boxes. To address this challenging task, we introduce a novel few-shot transformer architecture with a dedicated encoder-decoder structure optimized for joint commonality learning and localization prediction, without the need for proposals. Experiments on our reorganizations of the AVA and UCF101-24 datasets show the effectiveness of our approach for few-shot common action localization, even when the support videos are noisy. Although we are not specifically designed for common localization in time only, we also compare favorably against the few-shot and one-shot state-of-the-art in this setting. Lastly, we demonstrate that the few-shot transformer is easily extended to common action localization per pixel.

Cite

Text

Yang et al. "Few-Shot Transformation of Common Actions into Time and Space." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.01577

Markdown

[Yang et al. "Few-Shot Transformation of Common Actions into Time and Space." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/yang2021cvpr-fewshot/) doi:10.1109/CVPR46437.2021.01577

BibTeX

@inproceedings{yang2021cvpr-fewshot,
  title     = {{Few-Shot Transformation of Common Actions into Time and Space}},
  author    = {Yang, Pengwan and Mettes, Pascal and Snoek, Cees G. M.},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {16031-16040},
  doi       = {10.1109/CVPR46437.2021.01577},
  url       = {https://mlanthology.org/cvpr/2021/yang2021cvpr-fewshot/}
}