Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

Abstract

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, we apply the m-RNN model to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval. The project page of this work is: www.stat.ucla.edu/~junhua.mao/m-RNN.html .

Cite

Text

Mao et al. "Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)." International Conference on Learning Representations, 2015.

Markdown

[Mao et al. "Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)." International Conference on Learning Representations, 2015.](https://mlanthology.org/iclr/2015/mao2015iclr-deep/)

BibTeX

@inproceedings{mao2015iclr-deep,
  title     = {{Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)}},
  author    = {Mao, Junhua and Xu, Wei and Yang, Yi and Wang, Jiang and Yuille, Alan L.},
  booktitle = {International Conference on Learning Representations},
  year      = {2015},
  url       = {https://mlanthology.org/iclr/2015/mao2015iclr-deep/}
}