Deep Learning for Video Captioning: A Review
Abstract
Deep learning has achieved great successes in solving specific artificial intelligence problems recently. Substantial progresses are made on Computer Vision (CV) and Natural Language Processing (NLP). As a connection between the two worlds of vision and language, video captioning is the task of producing a natural-language utterance (usually a sentence) that describes the visual content of a video. The task is naturally decomposed into two sub-tasks. One is to encode a video via a thorough understanding and learn visual representation. The other is caption generation, which decodes the learned representation into a sequential sentence, word by word. In this survey, we first formulate the problem of video captioning, then review state-of-the-art methods categorized by their emphasis on vision or language, and followed by a summary of standard datasets and representative approaches. Finally, we highlight the challenges which are not yet fully understood in this task and present future research directions.
Cite
Text
Chen et al. "Deep Learning for Video Captioning: A Review." International Joint Conference on Artificial Intelligence, 2019. doi:10.24963/IJCAI.2019/877Markdown
[Chen et al. "Deep Learning for Video Captioning: A Review." International Joint Conference on Artificial Intelligence, 2019.](https://mlanthology.org/ijcai/2019/chen2019ijcai-deep/) doi:10.24963/IJCAI.2019/877BibTeX
@inproceedings{chen2019ijcai-deep,
title = {{Deep Learning for Video Captioning: A Review}},
author = {Chen, Shaoxiang and Yao, Ting and Jiang, Yu-Gang},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2019},
pages = {6283-6290},
doi = {10.24963/IJCAI.2019/877},
url = {https://mlanthology.org/ijcai/2019/chen2019ijcai-deep/}
}