Show, Recall, and Tell: Image Captioning with Recall Mechanism
Abstract
Generating natural and accurate descriptions in image captioning has always been a challenge. In this paper, we propose a novel recall mechanism to imitate the way human conduct captioning. There are three parts in our recall mechanism : recall unit, semantic guide (SG) and recalled-word slot (RWS). Recall unit is a text-retrieval module designed to retrieve recalled words for images. SG and RWS are designed for the best use of recalled words. SG branch can generate a recalled context, which can guide the process of generating caption. RWS branch is responsible for copying recalled words to the caption. Inspired by pointing mechanism in text summarization, we adopt a soft switch to balance the generated-word probabilities between SG and RWS. In the CIDEr optimization step, we also introduce an individual recalled-word reward (WR) to boost training. Our proposed methods (SG+RWS+WR) achieve BLEU-4 / CIDEr / SPICE scores of 36.6 / 116.9 / 21.3 with cross-entropy loss and 38.7 / 129.1 / 22.4 with CIDEr optimization on MSCOCO Karpathy test split, which surpass the results of other state-of-the-art methods.
Cite
Text
Wang et al. "Show, Recall, and Tell: Image Captioning with Recall Mechanism." AAAI Conference on Artificial Intelligence, 2020. doi:10.1609/AAAI.V34I07.6898Markdown
[Wang et al. "Show, Recall, and Tell: Image Captioning with Recall Mechanism." AAAI Conference on Artificial Intelligence, 2020.](https://mlanthology.org/aaai/2020/wang2020aaai-show/) doi:10.1609/AAAI.V34I07.6898BibTeX
@inproceedings{wang2020aaai-show,
title = {{Show, Recall, and Tell: Image Captioning with Recall Mechanism}},
author = {Wang, Li and Bai, Zechen and Zhang, Yonghua and Lu, Hongtao},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2020},
pages = {12176-12183},
doi = {10.1609/AAAI.V34I07.6898},
url = {https://mlanthology.org/aaai/2020/wang2020aaai-show/}
}