Spatio-Temporal Attention Network for Video Instance Segmentation

Abstract

In this paper, we propose a method named spatio-temporal attention network for video instance segmentation. The spatio-temporal attention network can estimate the global correlation map between the successive frames and transfers it to the attention map. Added with the attention information, the new features may enhance the response of the instance for pre-defined categories. Therefore, the detection, segmentation and tracking accuracy will be greatly improved. Experimental result shows that combined with MaskTrack R-CNN, it may improve the video instance segmentation accuracy from 0.293 to 0.400@Youtube VIS test dataset with a single model. Our method took the 6th place in the video instance segmentation track of the 2nd Large-scale Video Object Segmentation Challenge.

Cite

Text

Liu et al. "Spatio-Temporal Attention Network for Video Instance Segmentation." IEEE/CVF International Conference on Computer Vision Workshops, 2019. doi:10.1109/ICCVW.2019.00092

Markdown

[Liu et al. "Spatio-Temporal Attention Network for Video Instance Segmentation." IEEE/CVF International Conference on Computer Vision Workshops, 2019.](https://mlanthology.org/iccvw/2019/liu2019iccvw-spatiotemporal/) doi:10.1109/ICCVW.2019.00092

BibTeX

@inproceedings{liu2019iccvw-spatiotemporal,
  title     = {{Spatio-Temporal Attention Network for Video Instance Segmentation}},
  author    = {Liu, Xiaoyu and Ren, Haibing and Ye, Tingmeng},
  booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
  year      = {2019},
  pages     = {725-727},
  doi       = {10.1109/ICCVW.2019.00092},
  url       = {https://mlanthology.org/iccvw/2019/liu2019iccvw-spatiotemporal/}
}