SMSTracker: Tri-Path Score Mask Sigma Fusion for Multi-Modal Tracking
Abstract
Multi-modal object tracking has emerged as a significant research focus in computer vision due to its robustness in complex environments, such as exposure variations, blur, and occlusions. Despite existing studies integrating supplementary modal information into pre-trained RGB trackers through visual prompt mechanisms, this approach exhibits a critical limitation: it inherently prioritizes RGB information as the dominant modality, thereby underutilizing the complementary information of alternative modalities. To address this fundamental limitation, we present SMSTracker, an innovative tri-path score mask sigma fusion framework for multi-modal tracking, including three key modules. Firstly, we design a tri-path Score Mask Fusion (SMF) module to evaluate and quantify the reliability of each modality, allowing optimal exploitation of complementary features between modalities. Secondly, we introduce a pioneering Sigma Interaction (SGI) module to facilitate a sophisticated fusion of modal features across tri-branches. Furthermore, we advance a Drop Key Fine-tuning (DKF) strategy to address the inherent challenge of unequal data contribution in multi-modal learning scenarios, thereby enhancing the model's capacity for comprehensive multi-modal information processing. Finally, extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event datasets demonstrate the significant performance improvements achieved by SMSTracker over existing state-of-the-art methods. Code and model are available at https://github.com/Leezed525/SMSTracker.
Cite
Text
Chan et al. "SMSTracker: Tri-Path Score Mask Sigma Fusion for Multi-Modal Tracking." International Conference on Computer Vision, 2025.Markdown
[Chan et al. "SMSTracker: Tri-Path Score Mask Sigma Fusion for Multi-Modal Tracking." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/chan2025iccv-smstracker/)BibTeX
@inproceedings{chan2025iccv-smstracker,
title = {{SMSTracker: Tri-Path Score Mask Sigma Fusion for Multi-Modal Tracking}},
author = {Chan, Sixian and Li, Zedong and Li, Wenhao and Lu, Shijian and Shen, Chunhua and Zhang, Xiaoqin},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {4766-4775},
url = {https://mlanthology.org/iccv/2025/chan2025iccv-smstracker/}
}