Listen to Look: Action Recognition by Previewing Audio

Abstract

In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies. First, we devise an ImgAud2Vid framework that hallucinates clip-level features by distilling from lighter modalities---a single frame and its accompanying audio---reducing short-term temporal redundancy for efficient clip-level recognition. Second, building on ImgAud2Vid, we further propose ImgAud-Skimming, an attention-based long short-term memory network that iteratively selects useful moments in untrimmed videos, reducing long-term temporal redundancy for efficient video-level recognition. Extensive experiments on four action recognition datasets demonstrate that our method achieves the state-of-the-art in terms of both recognition accuracy and speed.

Cite

Text

Gao et al. "Listen to Look: Action Recognition by Previewing Audio." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. doi:10.1109/CVPR42600.2020.01047

Markdown

[Gao et al. "Listen to Look: Action Recognition by Previewing Audio." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.](https://mlanthology.org/cvpr/2020/gao2020cvpr-listen/) doi:10.1109/CVPR42600.2020.01047

BibTeX

@inproceedings{gao2020cvpr-listen,
  title     = {{Listen to Look: Action Recognition by Previewing Audio}},
  author    = {Gao, Ruohan and Oh, Tae-Hyun and Grauman, Kristen and Torresani, Lorenzo},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2020},
  doi       = {10.1109/CVPR42600.2020.01047},
  url       = {https://mlanthology.org/cvpr/2020/gao2020cvpr-listen/}
}