S2 Transformer for Image Captioning

Abstract

Transformer-based architectures with grid features represent the state-of-the-art in visual and language reasoning tasks, such as visual question answering and image-text matching. However, directly applying them to image captioning may result in spatial and fine-grained semantic information loss. Their applicability to image captioning is still largely under-explored. Towards this goal, we propose a simple yet effective method, Spatial- and Scale-aware Transformer (S2 Transformer) for image captioning. Specifically, we firstly propose a Spatial-aware Pseudo-supervised (SP) module, which resorts to feature clustering to help preserve spatial information for grid features. Next, to maintain the model size and produce superior results, we build a simple weighted residual connection, named Scale-wise Reinforcement (SR) module, to simultaneously explore both low- and high-level encoded features with rich semantics. Extensive experiments on the MSCOCO benchmark demonstrate that our method achieves new state-of-art performance without bringing excessive parameters compared with the vanilla transformer. The source code is available at https://github.com/zchoi/S2-Transformer

Cite

Text

Zeng et al. "S2 Transformer for Image Captioning." International Joint Conference on Artificial Intelligence, 2022. doi:10.24963/IJCAI.2022/224

Markdown

[Zeng et al. "S2 Transformer for Image Captioning." International Joint Conference on Artificial Intelligence, 2022.](https://mlanthology.org/ijcai/2022/zeng2022ijcai-s/) doi:10.24963/IJCAI.2022/224

BibTeX

@inproceedings{zeng2022ijcai-s,
  title     = {{S2 Transformer for Image Captioning}},
  author    = {Zeng, Pengpeng and Zhang, Haonan and Song, Jingkuan and Gao, Lianli},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2022},
  pages     = {1608-1614},
  doi       = {10.24963/IJCAI.2022/224},
  url       = {https://mlanthology.org/ijcai/2022/zeng2022ijcai-s/}
}