Learning Spatio-Temporal Transformer for Visual Tracking

Abstract

In this paper, we present a new tracking architecture with an encoder-decoder transformer as the key component. The encoder models the global spatio-temporal feature dependencies between target objects and search regions, while the decoder learns a query embedding to predict the spatial positions of the target objects. Our method casts object tracking as a direct bounding box prediction problem, without using any proposals or predefined anchors. With the encoder-decoder transformer, the prediction of objects just uses a simple fully-convolutional network, which estimates the corners of objects directly. The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing, thus largely simplifying existing tracking pipelines. The proposed tracker achieves state-of-the-art performance on multiple challenging short-term and long-term benchmarks, while running at real-time speed, being 6x faster than Siam R-CNN. Code and models are open-sourced at https://github.com/researchmm/Stark.

Cite

Text

Yan et al. "Learning Spatio-Temporal Transformer for Visual Tracking." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.01028

Markdown

[Yan et al. "Learning Spatio-Temporal Transformer for Visual Tracking." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/yan2021iccv-learning/) doi:10.1109/ICCV48922.2021.01028

BibTeX

@inproceedings{yan2021iccv-learning,
  title     = {{Learning Spatio-Temporal Transformer for Visual Tracking}},
  author    = {Yan, Bin and Peng, Houwen and Fu, Jianlong and Wang, Dong and Lu, Huchuan},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {10448-10457},
  doi       = {10.1109/ICCV48922.2021.01028},
  url       = {https://mlanthology.org/iccv/2021/yan2021iccv-learning/}
}