Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention
Abstract
Automatically generating natural language description for video is an extremely complicated and challenging task. To tackle the obstacles of traditional LSTM-based model for video captioning, we propose a novel architecture to generate the optimal descriptions for videos, which focuses on constructing a new network structure that can generate sentences superior to the basic model with LSTM, and establishing special attention mechanisms that can provide more useful visual information for caption generation. This scheme discards the traditional LSTM, and exploits the fully convolutional network with coarse-to-fine and inherited attention designed according to the characteristics of fully convolutional structure. Our model cannot only outperform the basic LSTM-based model, but also achieve the comparable performance with those of state-of-the-art methods
Cite
Text
Fang et al. "Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention." AAAI Conference on Artificial Intelligence, 2019. doi:10.1609/AAAI.V33I01.33018271Markdown
[Fang et al. "Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention." AAAI Conference on Artificial Intelligence, 2019.](https://mlanthology.org/aaai/2019/fang2019aaai-fully/) doi:10.1609/AAAI.V33I01.33018271BibTeX
@inproceedings{fang2019aaai-fully,
title = {{Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention}},
author = {Fang, Kuncheng and Zhou, Lian and Jin, Cheng and Zhang, Yuejie and Weng, Kangnian and Zhang, Tao and Fan, Weiguo},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2019},
pages = {8271-8278},
doi = {10.1609/AAAI.V33I01.33018271},
url = {https://mlanthology.org/aaai/2019/fang2019aaai-fully/}
}