HFF-Tracker: A Hierarchical Fine-Grained Fusion Tracker for Referring Multi-Object Tracking
Abstract
Referring Multi-Object Tracking (RMOT) aims to track multiple objects based on a provided language expression. Although prior studies have sought to accomplish this by integrating an textual module into the multi-object tracker, these methods combine text and image features in a basic way, neglecting the importance of text features. In this study, we propose a Hierarchical Fine-grained text-image Fusion tracker, named HFF-Tracker, which can perform fine-grained fusion of pixel-level visual features and text features across various semantic levels. Specifically, we have devised a Hierarchical Multi-Modal Fusion (HMMF) module to merge text and image features at an early stage in a hierarchical and detailed manner. The Text-Guided Decoder (TGD) is designed to provide the query with prior semantic information during the decoding process. Additionally, we have crafted a Text-Guided Prediction Head (TGPH) that utilizes text information to enhance the performance of the prediction head. Furthermore, we have implemented an adaptive Look-Back training strategy to maximize the utilization of valuable labeled data. Extensive experiments on the Refer-KITTI dataset and the Refer-KITTI-V2 dataset demonstrate that our proposed HFF-Tracker outperforms other state-of-the-art methods with remarkable margins.
Cite
Text
Zhao et al. "HFF-Tracker: A Hierarchical Fine-Grained Fusion Tracker for Referring Multi-Object Tracking." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I10.33143Markdown
[Zhao et al. "HFF-Tracker: A Hierarchical Fine-Grained Fusion Tracker for Referring Multi-Object Tracking." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zhao2025aaai-hff/) doi:10.1609/AAAI.V39I10.33143BibTeX
@inproceedings{zhao2025aaai-hff,
title = {{HFF-Tracker: A Hierarchical Fine-Grained Fusion Tracker for Referring Multi-Object Tracking}},
author = {Zhao, Zeyong and Hao, Yanchao and Zhang, Minghao and Liu, Qingbin and Li, Bo and Sui, Dianbo and He, Shizhu and Chen, Xi},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {10528-10536},
doi = {10.1609/AAAI.V39I10.33143},
url = {https://mlanthology.org/aaai/2025/zhao2025aaai-hff/}
}