Exploring the Feature Extraction and Relation Modeling for Light-Weight Transformer Tracking

Abstract

Recent advancements in transformer-based lightweight object tracking have set new standards across various benchmarks due to their efficiency and effectiveness. Despite these achievements, most current trackers rely heavily on pre-existing object detection architectures without optimizing the backbone network to leverage the unique demands of object tracking. Addressing this gap, we introduce the Feature Extraction and Relation Modeling Tracker (FERMT) - a novel approach that significantly enhances tracking speed and accuracy. At the heart of FERMT is a strategic decomposition of the conventional attention mechanism into four distinct sub-modules within a one-stream tracker. This design stems from our insight that the initial layers of a tracking network should prioritize feature extraction, whereas the deeper layers should focus on relation modeling between objects. Consequently, we propose an innovative, lightweight backbone specifically tailored for object tracking. Our approach is validated through meticulous ablation studies, confirming the effectiveness of our architectural decisions. Furthermore, FERMT incorporates a Dual Attention Unit for feature pre-processing, which facilitates global feature interaction across channels and enriches feature representation with attention cues. Benchmarking on GOT-10k, FERMT achieves a groundbreaking Average Overlap (AO) score of 69.6%, outperforming the leading real-time trackers by 5.6% in accuracy while boasting a 54% improvement in CPU tracking speed. This work not only sets a new standard for state-of-the-art (SOTA) performance in light-weight tracking but also bridges the efficiency gap between fast and high-performance trackers. The code and models are available at https://github.com/KarlesZheng/FERMT.

Cite

Text

Zheng et al. "Exploring the Feature Extraction and Relation Modeling for Light-Weight Transformer Tracking." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73397-0_7

Markdown

[Zheng et al. "Exploring the Feature Extraction and Relation Modeling for Light-Weight Transformer Tracking." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/zheng2024eccv-exploring/) doi:10.1007/978-3-031-73397-0_7

BibTeX

@inproceedings{zheng2024eccv-exploring,
  title     = {{Exploring the Feature Extraction and Relation Modeling for Light-Weight Transformer Tracking}},
  author    = {Zheng, Jikai and Liang, Mingjiang and Huang, Shaoli and Ning, Jifeng},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73397-0_7},
  url       = {https://mlanthology.org/eccv/2024/zheng2024eccv-exploring/}
}