Image Caption Generation with Hierarchical Contextual Visual Spatial Attention

Abstract

We present a novel context-aware attention-based deep architecture for image caption generation. Our architecture employs a Bidirectional Grid LSTM, which takes visual features of an image as input and learns complex spatial patterns based on two-dimensional context, by selecting or ignoring its input. The Grid LSTM has not been applied to image caption generation task before. Another novel aspect is that we leverage a set of local region-grounded texts obtained by transfer learning. The region-grounded texts often describe the properties of the objects and their relationships in an image. To generate a global caption for the image, we integrate the spatial features from the Grid LSTM with the local region-grounded texts, using a two-layer Bidirectional LSTM. The first layer models the global scene context such as object presence. The second layer utilizes a novel dynamic spatial attention mechanism, based on another Grid LSTM, to generate the global caption word-by-word, while considering the caption context around a word in both directions. Unlike recent models that use a soft attention mechanism, our dynamic spatial attention mechanism considers the spatial context of the image regions. Experimental results on MS-COCO dataset show that our architecture outperforms the state-of-the-art.

Cite

Text

Khademi and Schulte. "Image Caption Generation with Hierarchical Contextual Visual Spatial Attention." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018. doi:10.1109/CVPRW.2018.00260

Markdown

[Khademi and Schulte. "Image Caption Generation with Hierarchical Contextual Visual Spatial Attention." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018.](https://mlanthology.org/cvprw/2018/khademi2018cvprw-image/) doi:10.1109/CVPRW.2018.00260

BibTeX

@inproceedings{khademi2018cvprw-image,
  title     = {{Image Caption Generation with Hierarchical Contextual Visual Spatial Attention}},
  author    = {Khademi, Mahmoud and Schulte, Oliver},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2018},
  pages     = {1943-1951},
  doi       = {10.1109/CVPRW.2018.00260},
  url       = {https://mlanthology.org/cvprw/2018/khademi2018cvprw-image/}
}