Backbone Is All Your Need: A Simplified Architecture for Visual Object Tracking

Abstract

Exploiting a general-purpose neural architecture to replace hand-wired designs or inductive biases has recently drawn extensive interest. However, existing tracking approaches rely on customized sub-modules and need prior knowledge for architecture selection, hindering the development of tracking in a more general system. This paper presents a Simplified Tracking architecture (SimTrack) by leveraging a transformer backbone for joint feature extraction and interaction. Unlike existing Siamese trackers, we serialize the input images and concatenate them directly before the one-branch backbone. Feature interaction in the backbone helps to remove well-designed interaction modules and produce a more efficient and effective framework. To reduce the information loss from down-sampling in vision transformers, we further propose a foveal window strategy, providing more diverse input patches with acceptable computational costs. Our SimTrack improves the baseline with 2.5%/2.6% AUC gains on LaSOT/TNL2K and gets results competitive with other specialized tracking algorithms without bells and whistles. The source codes are available at https://github.com/LPXTT/SimTrack.

Cite

Text

Chen et al. "Backbone Is All Your Need: A Simplified Architecture for Visual Object Tracking." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-20047-2_22

Markdown

[Chen et al. "Backbone Is All Your Need: A Simplified Architecture for Visual Object Tracking." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/chen2022eccv-backbone/) doi:10.1007/978-3-031-20047-2_22

BibTeX

@inproceedings{chen2022eccv-backbone,
  title     = {{Backbone Is All Your Need: A Simplified Architecture for Visual Object Tracking}},
  author    = {Chen, Boyu and Li, Peixia and Bai, Lei and Qiao, Lei and Shen, Qiuhong and Li, Bo and Gan, Weihao and Wu, Wei and Ouyang, Wanli},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-20047-2_22},
  url       = {https://mlanthology.org/eccv/2022/chen2022eccv-backbone/}
}