MAM-RNN: Multi-Level Attention Model Based RNN for Video Captioning

Li, Xuelong; Zhao, Bin; Lu, Xiaoqiang

doi:10.24963/IJCAI.2017/307

MAM-RNN: Multi-Level Attention Model Based RNN for Video Captioning

Xuelong Li, Bin Zhao, Xiaoqiang Lu

IJCAI 2017 pp. 2208-2214

doi:10.24963/IJCAI.2017/307 /ijcai/2017/li2017ijcai-mam/

Abstract

Visual information is quite important for the task of video captioning. However, in the video, there are a lot of uncorrelated content, which may cause interference to generate a correct caption. Based on this point, we attempt to exploit the visual features which are most correlated to the caption. In this paper, a Multi-level Attention Model based Recurrent Neural Network (MAM-RNN) is proposed, where MAM is utilized to encode the visual feature and RNN works as the decoder to generate the video caption. During generation, the proposed approach is able to adaptively attend to the salient regions in the frame and the frames correlated to the caption. Practically, the experimental results on two benchmark datasets, i.e., MSVD and Charades, have shown the excellent performance of the proposed approach.

PDF IJCAI Semantic Scholar

Cite

Text

Li et al. "MAM-RNN: Multi-Level Attention Model Based RNN for Video Captioning." International Joint Conference on Artificial Intelligence, 2017. doi:10.24963/IJCAI.2017/307

Markdown

[Li et al. "MAM-RNN: Multi-Level Attention Model Based RNN for Video Captioning." International Joint Conference on Artificial Intelligence, 2017.](https://mlanthology.org/ijcai/2017/li2017ijcai-mam/) doi:10.24963/IJCAI.2017/307

BibTeX

@inproceedings{li2017ijcai-mam,
  title     = {{MAM-RNN: Multi-Level Attention Model Based RNN for Video Captioning}},
  author    = {Li, Xuelong and Zhao, Bin and Lu, Xiaoqiang},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2017},
  pages     = {2208-2214},
  doi       = {10.24963/IJCAI.2017/307},
  url       = {https://mlanthology.org/ijcai/2017/li2017ijcai-mam/}
}