Learning Tracking Representations via Dual-Branch Fully Transformer Networks

Abstract

We present a Siamese-like Dual-branch network based on solely Transformers for tracking. Given a template and a search image, we divide them into non-overlapping patches and extract a feature vector for each patch based on its matching results with others within an attention window. For each token, we estimate whether it contains the target object and the corresponding size. The advantage of the approach is that the features are learned from matching, and ultimately, for matching. So the features are aligned with the object tracking task. The method achieves better or comparable results as the best-performing methods which first use CNN to extract features and then use Transformer to fuse them. It outperforms the state-of-the-art methods on the GOT-10k and VOT2020 benchmarks. In addition, the method achieves real-time inference speed (about 40 fps) on one GPU. The code and models are released at https://github.com/phiphiphi31/DualTFR.

Cite

Text

Xie et al. "Learning Tracking Representations via Dual-Branch Fully Transformer Networks." IEEE/CVF International Conference on Computer Vision Workshops, 2021. doi:10.1109/ICCVW54120.2021.00303

Markdown

[Xie et al. "Learning Tracking Representations via Dual-Branch Fully Transformer Networks." IEEE/CVF International Conference on Computer Vision Workshops, 2021.](https://mlanthology.org/iccvw/2021/xie2021iccvw-learning/) doi:10.1109/ICCVW54120.2021.00303

BibTeX

@inproceedings{xie2021iccvw-learning,
  title     = {{Learning Tracking Representations via Dual-Branch Fully Transformer Networks}},
  author    = {Xie, Fei and Wang, Chunyu and Wang, Guangting and Yang, Wankou and Zeng, Wenjun},
  booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
  year      = {2021},
  pages     = {2688-2697},
  doi       = {10.1109/ICCVW54120.2021.00303},
  url       = {https://mlanthology.org/iccvw/2021/xie2021iccvw-learning/}
}