SBAT: Video Captioning with Sparse Boundary-Aware Transformer
Abstract
In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.
Cite
Text
Jin et al. "SBAT: Video Captioning with Sparse Boundary-Aware Transformer." International Joint Conference on Artificial Intelligence, 2020. doi:10.24963/IJCAI.2020/88Markdown
[Jin et al. "SBAT: Video Captioning with Sparse Boundary-Aware Transformer." International Joint Conference on Artificial Intelligence, 2020.](https://mlanthology.org/ijcai/2020/jin2020ijcai-sbat/) doi:10.24963/IJCAI.2020/88BibTeX
@inproceedings{jin2020ijcai-sbat,
title = {{SBAT: Video Captioning with Sparse Boundary-Aware Transformer}},
author = {Jin, Tao and Huang, Siyu and Chen, Ming and Li, Yingming and Zhang, Zhongfei},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2020},
pages = {630-636},
doi = {10.24963/IJCAI.2020/88},
url = {https://mlanthology.org/ijcai/2020/jin2020ijcai-sbat/}
}