Foundation Model Driven Appearance Extraction for Robust Multiple Object Tracking

Abstract

Multiple Object Tracking (MOT) is a fundamental task in computer vision. Existing methods utilize motion information or appearance information to perform object tracking. However, these algorithms still struggle with special circumstances, such as occlusion and blurring in complex scenes. Inspired by the fact that people can pinpoint objects through verbal descriptions, we explore performing long-term robust tracking using semantic features of objects. Motivated by the success of the multimodal foundation model in text-image alignment, we reconsider the appearance feature extraction module in MOT and propose a Foundation model Driven multi-object tracker (FDTracker). Specifically, we propose a two-stage trained appearance feature extractor. In the first stage, using a single image of the object as input, the model could capture the attributes of objects with the assistance of natural language instructions. In the second stage, using a sequence of images of objects as input, the model learns how to use these attributes to distinguish between different objects and connect the same object at different times. Finally, for coordinating appearance and motion information, we propose a reasonable combined strategy, which better facilitates trajectory assignment and reconnection. Extensive experiments on benchmarks demonstrate the robustness of FDTracker.

Cite

Text

Fu et al. "Foundation Model Driven Appearance Extraction for Robust Multiple Object Tracking." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I3.32311

Markdown

[Fu et al. "Foundation Model Driven Appearance Extraction for Robust Multiple Object Tracking." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/fu2025aaai-foundation/) doi:10.1609/AAAI.V39I3.32311

BibTeX

@inproceedings{fu2025aaai-foundation,
  title     = {{Foundation Model Driven Appearance Extraction for Robust Multiple Object Tracking}},
  author    = {Fu, Teng and Yu, Haiyang and Niu, Ke and Li, Bin and Xue, Xiangyang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {3031-3039},
  doi       = {10.1609/AAAI.V39I3.32311},
  url       = {https://mlanthology.org/aaai/2025/fu2025aaai-foundation/}
}