High-Performance Discriminative Tracking with Transformers
Abstract
End-to-end discriminative trackers improve the state of the art significantly, yet the improvement in robustness and efficiency is restricted by the conventional discriminative model, i.e., least-squares based regression. In this paper, we present DTT, a novel single-object discriminative tracker, based on an encoder-decoder Transformer architecture. By self- and encoder-decoder attention mechanisms, our approach is able to exploit the rich scene information in an end-to-end manner, effectively removing the need for hand-designed discriminative models. In online tracking, given a new test frame, dense prediction is performed at all spatial positions. Not only location, but also bounding box of the target object is obtained in a robust fashion, streamlining the discriminative tracking pipeline. DTT is conceptually simple and easy to implement. It yields state-of-the-art performance on four popular benchmarks including GOT-10k, LaSOT, NfS, and TrackingNet while running at over 50 FPS, confirming its effectiveness and efficiency. We hope DTT may provide a new perspective for single-object visual tracking.
Cite
Text
Yu et al. "High-Performance Discriminative Tracking with Transformers." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00971Markdown
[Yu et al. "High-Performance Discriminative Tracking with Transformers." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/yu2021iccv-highperformance/) doi:10.1109/ICCV48922.2021.00971BibTeX
@inproceedings{yu2021iccv-highperformance,
title = {{High-Performance Discriminative Tracking with Transformers}},
author = {Yu, Bin and Tang, Ming and Zheng, Linyu and Zhu, Guibo and Wang, Jinqiao and Feng, Hao and Feng, Xuetao and Lu, Hanqing},
booktitle = {International Conference on Computer Vision},
year = {2021},
pages = {9856-9865},
doi = {10.1109/ICCV48922.2021.00971},
url = {https://mlanthology.org/iccv/2021/yu2021iccv-highperformance/}
}