Multi-Modal Pyramid Feature Combination for Human Action Recognition
Abstract
Accurate human action recognition remains a challenging task in the field of computer vision. While many approaches focus on narrow image features, this work proposes a novel multi-modal method that combines task specific features (action recognition, scene understanding, object detection and acoustic event detection) for human action recognition. This work encompasses two contributions: 1) The introduction of a feature fusion block that uses a gating mechanism to perform attention over features from other domains and 2) A pyramidal feature combination approach that hierarchically combines pairs of features from different tasks using the previous fusion block. The richer features generated by the pyramid are used for human action recognition. This approach is validated using a subset of the Moments In Time dataset, resulting in an accuracy of 35.43%.
Cite
Text
Roig et al. "Multi-Modal Pyramid Feature Combination for Human Action Recognition." IEEE/CVF International Conference on Computer Vision Workshops, 2019. doi:10.1109/ICCVW.2019.00465Markdown
[Roig et al. "Multi-Modal Pyramid Feature Combination for Human Action Recognition." IEEE/CVF International Conference on Computer Vision Workshops, 2019.](https://mlanthology.org/iccvw/2019/roig2019iccvw-multimodal/) doi:10.1109/ICCVW.2019.00465BibTeX
@inproceedings{roig2019iccvw-multimodal,
title = {{Multi-Modal Pyramid Feature Combination for Human Action Recognition}},
author = {Roig, Carlos and Sarmiento, Manuel and Varas, David and Masuda, Issey and Riveiro, Juan Carlos and Bou-Balust, Elisenda},
booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
year = {2019},
pages = {3742-3746},
doi = {10.1109/ICCVW.2019.00465},
url = {https://mlanthology.org/iccvw/2019/roig2019iccvw-multimodal/}
}