Image Caption with Global-Local Attention

Abstract

Image caption is becoming important in the field of artificial intelligence. Most existing methods based on CNN-RNN framework suffer from the problems of object missing and misprediction due to the mere use of global representation at image-level. To address these problems, in this paper, we propose a global-local attention (GLA) method by integrating local representation at object-level with global representation at image-level through attention mechanism. Thus, our proposed method can pay more attention to how to predict the salient objects more precisely with high recall while keeping context information at image-level cocurrently. Therefore, our proposed GLA method can generate more relevant sentences, and achieve the state-of-the-art performance on the well-known Microsoft COCO caption dataset with several popular metrics.

Cite

Text

Li et al. "Image Caption with Global-Local Attention." AAAI Conference on Artificial Intelligence, 2017. doi:10.1609/AAAI.V31I1.11236

Markdown

[Li et al. "Image Caption with Global-Local Attention." AAAI Conference on Artificial Intelligence, 2017.](https://mlanthology.org/aaai/2017/li2017aaai-image/) doi:10.1609/AAAI.V31I1.11236

BibTeX

@inproceedings{li2017aaai-image,
  title     = {{Image Caption with Global-Local Attention}},
  author    = {Li, Linghui and Tang, Sheng and Deng, Lixi and Zhang, Yongdong and Tian, Qi},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2017},
  pages     = {4133-4139},
  doi       = {10.1609/AAAI.V31I1.11236},
  url       = {https://mlanthology.org/aaai/2017/li2017aaai-image/}
}