Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video Description

Abstract

We incorporate audio features, in addition to image and motion features, for video description based on encoder-decoder recurrent neural networks (RNNs). To fuse these modalities, we introduce a multimodal attention model that can selectively utilize features from different modalities for each word in the output description. We apply our new framework for video description using state-of-the-art audio features such as SoundNet and Audio set VGGish, and state-of-the-art image and spatiotemporal features such as I3D. Results confirm that our attention-based multi-modal fusion of audio features with visual features outperforms conventional video description approaches on three datasets.

Cite

Text

Hori et al. "Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video Description." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018.

Markdown

[Hori et al. "Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video Description." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018.](https://mlanthology.org/cvprw/2018/hori2018cvprw-multimodal/)

BibTeX

@inproceedings{hori2018cvprw-multimodal,
  title     = {{Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video Description}},
  author    = {Hori, Chiori and Hori, Takaaki and Wichern, Gordon and Wang, Jue and Lee, Teng-Yok and Cherian, Anoop and Marks, Tim K.},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2018},
  pages     = {2528-2531},
  url       = {https://mlanthology.org/cvprw/2018/hori2018cvprw-multimodal/}
}