Transform and Tell: Entity-Aware News Image Captioning

Tran, Alasdair; Mathews, Alexander; Xie, Lexing

doi:10.1109/CVPR42600.2020.01305

Transform and Tell: Entity-Aware News Image Captioning

Alasdair Tran, Alexander Mathews, Lexing Xie

CVPR 2020

doi:10.1109/CVPR42600.2020.01305 /cvpr/2020/tran2020cvpr-transform/

Abstract

We propose an end-to-end model which generates captions for images embedded in news articles. News images present two key challenges: they rely on real-world knowledge, especially about named entities; and they typically have linguistically rich captions that include uncommon words. We address the first challenge by associating words in the caption with faces and objects in the image, via a multi-modal, multi-head attention mechanism. We tackle the second challenge with a state-of-the-art transformer language model that uses byte-pair-encoding to generate captions as a sequence of word parts. On the GoodNews dataset, our model outperforms the previous state of the art by a factor of four in CIDEr score (13 to 54). This performance gain comes from a unique combination of language models, word representation, image embeddings, face embeddings, object embeddings, and improvements in neural network design. We also introduce the NYTimes800k dataset which is 70% larger than GoodNews, has higher article quality, and includes the locations of images within articles as an additional contextual cue.

PDF CVPR Semantic Scholar

Cite

Text

Tran et al. "Transform and Tell: Entity-Aware News Image Captioning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. doi:10.1109/CVPR42600.2020.01305

Markdown

[Tran et al. "Transform and Tell: Entity-Aware News Image Captioning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.](https://mlanthology.org/cvpr/2020/tran2020cvpr-transform/) doi:10.1109/CVPR42600.2020.01305

BibTeX

@inproceedings{tran2020cvpr-transform,
  title     = {{Transform and Tell: Entity-Aware News Image Captioning}},
  author    = {Tran, Alasdair and Mathews, Alexander and Xie, Lexing},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2020},
  doi       = {10.1109/CVPR42600.2020.01305},
  url       = {https://mlanthology.org/cvpr/2020/tran2020cvpr-transform/}
}