A-Cap: Anticipation Captioning with Commonsense Knowledge
Abstract
Humans possess the capacity to reason about the future based on a sparse collection of visual cues acquired over time. In order to emulate this ability, we introduce a novel task called Anticipation Captioning, which generates a caption for an unseen oracle image using a sparsely temporally-ordered set of images. To tackle this new task, we propose a model called A-CAP, which incorporates commonsense knowledge into a pre-trained vision-language model, allowing it to anticipate the caption. Through both qualitative and quantitative evaluations on a customized visual storytelling dataset, A-CAP outperforms other image captioning methods and establishes a strong baseline for anticipation captioning. We also address the challenges inherent in this task.
Cite
Text
Vo et al. "A-Cap: Anticipation Captioning with Commonsense Knowledge." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01042Markdown
[Vo et al. "A-Cap: Anticipation Captioning with Commonsense Knowledge." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/vo2023cvpr-acap/) doi:10.1109/CVPR52729.2023.01042BibTeX
@inproceedings{vo2023cvpr-acap,
title = {{A-Cap: Anticipation Captioning with Commonsense Knowledge}},
author = {Vo, Duc Minh and Luong, Quoc-An and Sugimoto, Akihiro and Nakayama, Hideki},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2023},
pages = {10824-10833},
doi = {10.1109/CVPR52729.2023.01042},
url = {https://mlanthology.org/cvpr/2023/vo2023cvpr-acap/}
}