Residual Attention-Based Fusion for Video Classification

Abstract

Video data is inherently multimodal and sequential. Therefore, deep learning models need to aggregate all data modalities while capturing the most relevant spatio-temporal information from a given video. This paper presents a multimodal deep learning framework for video classification using a Residual Attention-based Fusion (RAF) method. Specifically, this framework extracts spatio-temporal features from each modality using residual attention-based bidirectional Long Short-Term Memory and fuses the information using a weighted Support Vector Machine to handle the imbalanced data. Experimental results on a natural disaster video dataset show that our approach improves upon the state-of-the-art by 5% and 8% regarding F1 and MAP metrics, respectively. Most remarkably, our proposed residual attention model reaches a 0.95 F1-score and 0.92 MAP for this dataset.

Cite

Text

Pouyanfar et al. "Residual Attention-Based Fusion for Video Classification." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019. doi:10.1109/CVPRW.2019.00064

Markdown

[Pouyanfar et al. "Residual Attention-Based Fusion for Video Classification." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019.](https://mlanthology.org/cvprw/2019/pouyanfar2019cvprw-residual/) doi:10.1109/CVPRW.2019.00064

BibTeX

@inproceedings{pouyanfar2019cvprw-residual,
  title     = {{Residual Attention-Based Fusion for Video Classification}},
  author    = {Pouyanfar, Samira and Wang, Tianyi and Chen, Shu-Ching},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2019},
  pages     = {478-480},
  doi       = {10.1109/CVPRW.2019.00064},
  url       = {https://mlanthology.org/cvprw/2019/pouyanfar2019cvprw-residual/}
}