A Tale of Two Modalities for Video Captioning
Abstract
Recent advances in machine learning have led to significant accuracy improvements for the task of generating textual captions from videos based on audio and visual signals. In this work, we focus on the influence of modality (audio and visual input) on semantic coherence and well-formedness of the generated captions. We explore both architectural and algorithmic choices that potentially influence the utilization of these modalities. (Algorithmic choices include pretraining while architectural choices include modality-specific weighing schemes.) We study the influence of our choices on a popular video captioning dataset, MSRVTT, by providing quantitative and ex-tensive qualitative evaluations that measure the influence of audio-visual modalities, cohesiveness and ranked relevance of keywords. We are able to assert qualitative improvements on metrics characterizing the quality of the captions, along with obtaining performance that is comparable to state-of-the-art on standard quantitative metrics such as BLEU-4, METEOR, etc.
Cite
Text
Joshi et al. "A Tale of Two Modalities for Video Captioning." IEEE/CVF International Conference on Computer Vision Workshops, 2019. doi:10.1109/ICCVW.2019.00459Markdown
[Joshi et al. "A Tale of Two Modalities for Video Captioning." IEEE/CVF International Conference on Computer Vision Workshops, 2019.](https://mlanthology.org/iccvw/2019/joshi2019iccvw-tale/) doi:10.1109/ICCVW.2019.00459BibTeX
@inproceedings{joshi2019iccvw-tale,
title = {{A Tale of Two Modalities for Video Captioning}},
author = {Joshi, Pankaj and Saharia, Chitwan and Singh, Vishwajeet and Gautam, Digvijaysingh and Ramakrishnan, Ganesh and Jyothi, Preethi},
booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
year = {2019},
pages = {3708-3712},
doi = {10.1109/ICCVW.2019.00459},
url = {https://mlanthology.org/iccvw/2019/joshi2019iccvw-tale/}
}