Backbone Is All Your Need: A Simplified Architecture for Visual Object Tracking
Abstract
Exploiting a general-purpose neural architecture to replace hand-wired designs or inductive biases has recently drawn extensive interest. However, existing tracking approaches rely on customized sub-modules and need prior knowledge for architecture selection, hindering the development of tracking in a more general system. This paper presents a Simplified Tracking architecture (SimTrack) by leveraging a transformer backbone for joint feature extraction and interaction. Unlike existing Siamese trackers, we serialize the input images and concatenate them directly before the one-branch backbone. Feature interaction in the backbone helps to remove well-designed interaction modules and produce a more efficient and effective framework. To reduce the information loss from down-sampling in vision transformers, we further propose a foveal window strategy, providing more diverse input patches with acceptable computational costs. Our SimTrack improves the baseline with 2.5%/2.6% AUC gains on LaSOT/TNL2K and gets results competitive with other specialized tracking algorithms without bells and whistles. The source codes are available at https://github.com/LPXTT/SimTrack.
Cite
Text
Chen et al. "Backbone Is All Your Need: A Simplified Architecture for Visual Object Tracking." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-20047-2_22Markdown
[Chen et al. "Backbone Is All Your Need: A Simplified Architecture for Visual Object Tracking." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/chen2022eccv-backbone/) doi:10.1007/978-3-031-20047-2_22BibTeX
@inproceedings{chen2022eccv-backbone,
title = {{Backbone Is All Your Need: A Simplified Architecture for Visual Object Tracking}},
author = {Chen, Boyu and Li, Peixia and Bai, Lei and Qiao, Lei and Shen, Qiuhong and Li, Bo and Gan, Weihao and Wu, Wei and Ouyang, Wanli},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022},
doi = {10.1007/978-3-031-20047-2_22},
url = {https://mlanthology.org/eccv/2022/chen2022eccv-backbone/}
}