Show and Tell More: Topic-Oriented Multi-Sentence Image Captioning

Abstract

Image captioning aims to generate textual descriptions for images. Most previous work generates a single-sentence description for each image. However, a picture is worth a thousand words. Single-sentence can hardly give a complete view of an image even by humans. In this paper, we propose a novel Topic-Oriented Multi-Sentence (\emph{TOMS}) captioning model, which can generate multiple topic-oriented sentences to describe an image. Different from object instances or attributes, topics mined by the latent Dirichlet allocation reflect hidden thematic structures in reference sentences of an image. In our model, each topic is integrated to a caption generator with a Fusion Gate Unit (FGU) to guide the generation of a sentence towards a certain topic perspective. With multiple sentences from different topics, our \emph{TOMS} provides a complete description of an image. Experimental results on both sentence and paragraph datasets demonstrate the effectiveness of our \emph{TOMS} in terms of topical consistency and descriptive completeness.

Cite

Text

Mao et al. "Show and Tell More: Topic-Oriented Multi-Sentence Image Captioning." International Joint Conference on Artificial Intelligence, 2018. doi:10.24963/IJCAI.2018/592

Markdown

[Mao et al. "Show and Tell More: Topic-Oriented Multi-Sentence Image Captioning." International Joint Conference on Artificial Intelligence, 2018.](https://mlanthology.org/ijcai/2018/mao2018ijcai-show/) doi:10.24963/IJCAI.2018/592

BibTeX

@inproceedings{mao2018ijcai-show,
  title     = {{Show and Tell More: Topic-Oriented Multi-Sentence Image Captioning}},
  author    = {Mao, Yuzhao and Zhou, Chang and Wang, Xiaojie and Li, Ruifan},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2018},
  pages     = {4258-4264},
  doi       = {10.24963/IJCAI.2018/592},
  url       = {https://mlanthology.org/ijcai/2018/mao2018ijcai-show/}
}