Lightweight Action Recognition in Compressed Videos

Abstract

Most existing action recognition models are large convolutional neural networks that work only with raw RGB frames as input. However, practical applications require lightweight models that directly process compressed videos. In this work, for the first time, such a model is developed, which is lightweight enough to run in real-time on embedded AI devices without sacrifices in recognition accuracy. A new Aligned Temporal Trilinear Pooling (ATTP) module is formulated to fuse three modalities in a compressed video. To remedy the weaker motion vectors (compared to optical flow computed from raw RGB streams) for representing dynamic content, we introduce a temporal fusion method to explicitly induce the temporal context, as well as knowledge distillation from a model trained with optical flows via feature alignment. Compared to existing compressed video action recognition models, it is much more compact and faster thanks to adopting a lightweight CNN backbone.

Cite

Text

Huo et al. "Lightweight Action Recognition in Compressed Videos." European Conference on Computer Vision Workshops, 2020. doi:10.1007/978-3-030-66096-3_24

Markdown

[Huo et al. "Lightweight Action Recognition in Compressed Videos." European Conference on Computer Vision Workshops, 2020.](https://mlanthology.org/eccvw/2020/huo2020eccvw-lightweight/) doi:10.1007/978-3-030-66096-3_24

BibTeX

@inproceedings{huo2020eccvw-lightweight,
  title     = {{Lightweight Action Recognition in Compressed Videos}},
  author    = {Huo, Yuqi and Xu, Xiaoli and Lu, Yao and Niu, Yulei and Ding, Mingyu and Lu, Zhiwu and Xiang, Tao and Wen, Ji-Rong},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2020},
  pages     = {337-352},
  doi       = {10.1007/978-3-030-66096-3_24},
  url       = {https://mlanthology.org/eccvw/2020/huo2020eccvw-lightweight/}
}