SparseTT: Visual Tracking with Sparse Transformers
Abstract
Transformers have been successfully applied to the visual tracking task and significantly promote tracking performance. The self-attention mechanism designed to model long-range dependencies is the key to the success of Transformers. However, self-attention lacks focusing on the most relevant information in the search regions, making it easy to be distracted by background. In this paper, we relieve this issue with a sparse attention mechanism by focusing the most relevant information in the search regions, which enables a much accurate tracking. Furthermore, we introduce a double-head predictor to boost the accuracy of foreground-background classification and regression of target bounding boxes, which further improve the tracking performance. Extensive experiments show that, without bells and whistles, our method significantly outperforms the state-of-the-art approaches on LaSOT, GOT-10k, TrackingNet, and UAV123, while running at 40 FPS. Notably, the training time of our method is reduced by 75% compared to that of TransT. The source code and models are available at https://github.com/fzh0917/SparseTT.
Cite
Text
Fu et al. "SparseTT: Visual Tracking with Sparse Transformers." International Joint Conference on Artificial Intelligence, 2022. doi:10.24963/IJCAI.2022/127Markdown
[Fu et al. "SparseTT: Visual Tracking with Sparse Transformers." International Joint Conference on Artificial Intelligence, 2022.](https://mlanthology.org/ijcai/2022/fu2022ijcai-sparsett/) doi:10.24963/IJCAI.2022/127BibTeX
@inproceedings{fu2022ijcai-sparsett,
title = {{SparseTT: Visual Tracking with Sparse Transformers}},
author = {Fu, Zhihong and Fu, Zehua and Liu, Qingjie and Cai, Wenrui and Wang, Yunhong},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2022},
pages = {905-912},
doi = {10.24963/IJCAI.2022/127},
url = {https://mlanthology.org/ijcai/2022/fu2022ijcai-sparsett/}
}