Semi-Autoregressive Transformer for Image Captioning

Abstract

Current state-of-the-art image captioning models adopt autoregressive decoders, i.e. they generate each word by conditioning on previously generated words, which leads to heavy latency during inference. To tackle this issue, non-autoregressive image captioning models have recently been proposed to significantly accelerate the speed of inference by generating all words in parallel. However, these non-autoregressive models inevitably suffer from large generation quality degradation since they remove words dependence excessively. To make a better trade-off between speed and quality, we introduce a semi-autoregressive model for image captioning (dubbed as SATIC), which keeps the autoregressive property in global but generates words parallelly in local . Based on Transformer, there are only a few modifications needed to implement SATIC. Experimental results on the MSCOCO image captioning benchmark show that SATIC can achieve a good trade-off without bells and whistles. Code is available at https://github.com/YuanEZhou/satic.

Cite

Text

Zhou et al. "Semi-Autoregressive Transformer for Image Captioning." IEEE/CVF International Conference on Computer Vision Workshops, 2021. doi:10.1109/ICCVW54120.2021.00350

Markdown

[Zhou et al. "Semi-Autoregressive Transformer for Image Captioning." IEEE/CVF International Conference on Computer Vision Workshops, 2021.](https://mlanthology.org/iccvw/2021/zhou2021iccvw-semiautoregressive/) doi:10.1109/ICCVW54120.2021.00350

BibTeX

@inproceedings{zhou2021iccvw-semiautoregressive,
  title     = {{Semi-Autoregressive Transformer for Image Captioning}},
  author    = {Zhou, Yuanen and Zhang, Yong and Hu, Zhenzhen and Wang, Meng},
  booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
  year      = {2021},
  pages     = {3132-3136},
  doi       = {10.1109/ICCVW54120.2021.00350},
  url       = {https://mlanthology.org/iccvw/2021/zhou2021iccvw-semiautoregressive/}
}