Multi-Modal Pyramid Feature Combination for Human Action Recognition

Abstract

Accurate human action recognition remains a challenging task in the field of computer vision. While many approaches focus on narrow image features, this work proposes a novel multi-modal method that combines task specific features (action recognition, scene understanding, object detection and acoustic event detection) for human action recognition. This work encompasses two contributions: 1) The introduction of a feature fusion block that uses a gating mechanism to perform attention over features from other domains and 2) A pyramidal feature combination approach that hierarchically combines pairs of features from different tasks using the previous fusion block. The richer features generated by the pyramid are used for human action recognition. This approach is validated using a subset of the Moments In Time dataset, resulting in an accuracy of 35.43%.

Cite

Text

Roig et al. "Multi-Modal Pyramid Feature Combination for Human Action Recognition." IEEE/CVF International Conference on Computer Vision Workshops, 2019. doi:10.1109/ICCVW.2019.00465

Markdown

[Roig et al. "Multi-Modal Pyramid Feature Combination for Human Action Recognition." IEEE/CVF International Conference on Computer Vision Workshops, 2019.](https://mlanthology.org/iccvw/2019/roig2019iccvw-multimodal/) doi:10.1109/ICCVW.2019.00465

BibTeX

@inproceedings{roig2019iccvw-multimodal,
  title     = {{Multi-Modal Pyramid Feature Combination for Human Action Recognition}},
  author    = {Roig, Carlos and Sarmiento, Manuel and Varas, David and Masuda, Issey and Riveiro, Juan Carlos and Bou-Balust, Elisenda},
  booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
  year      = {2019},
  pages     = {3742-3746},
  doi       = {10.1109/ICCVW.2019.00465},
  url       = {https://mlanthology.org/iccvw/2019/roig2019iccvw-multimodal/}
}