Deeply-Supervised CNN Model for Action Recognition with Trainable Feature Aggregation

Abstract

In this paper, we propose a deeply-supervised CNN model for action recognition that fully exploits powerful hierarchical features of CNNs. In this model, we build multi-level video representations by applying our proposed aggregation module at different convolutional layers. Moreover, we train this model in a deep supervision manner, which brings improvement in both performance and efficiency. Meanwhile, in order to capture the temporal structure as well as preserve more details about actions, we propose a trainable aggregation module. It models the temporal evolution of each spatial location and projects them into a semantic space using the Vector of Locally Aggregated Descriptors (VLAD) technique. This deeply-supervised CNN model integrating the powerful aggregation module provides a promising solution to recognize actions in videos. We conduct experiments on two action recognition datasets: HMDB51 and UCF101. Results show that our model outperforms the state-of-the-art methods.

Cite

Text

Li et al. "Deeply-Supervised CNN Model for Action Recognition with Trainable Feature Aggregation." International Joint Conference on Artificial Intelligence, 2018. doi:10.24963/IJCAI.2018/112

Markdown

[Li et al. "Deeply-Supervised CNN Model for Action Recognition with Trainable Feature Aggregation." International Joint Conference on Artificial Intelligence, 2018.](https://mlanthology.org/ijcai/2018/li2018ijcai-deeply/) doi:10.24963/IJCAI.2018/112

BibTeX

@inproceedings{li2018ijcai-deeply,
  title     = {{Deeply-Supervised CNN Model for Action Recognition with Trainable Feature Aggregation}},
  author    = {Li, Yang and Li, Kan and Wang, Xinxin},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2018},
  pages     = {807-813},
  doi       = {10.24963/IJCAI.2018/112},
  url       = {https://mlanthology.org/ijcai/2018/li2018ijcai-deeply/}
}