DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM

Abstract

Visual Language Tracking (VLT) enhances single object tracking (SOT) by integrating natural language descriptions from a video, for the precise tracking of a specified object. By leveraging high-level semantic information, VLT guides object tracking, alleviating the constraints associated with relying on a visual modality. Nevertheless, most VLT benchmarks are annotated in a single granularity and lack a coherent semantic framework to provide scientific guidance. Moreover, coordinating human annotators for high-quality annotations is laborious and time-consuming. To address these challenges, we introduce DTLLM-VLT, which automatically generates extensive and multi-granularity text to enhance environmental diversity. (1) DTLLM-VLT generates scientific and multi-granularity text descriptions using a cohesive prompt framework. Its succinct and highly adaptable design allows seamless integration into various visual tracking benchmarks. (2) We select three prominent benchmarks to deploy our approach: short-term tracking, long-term tracking, and global instance tracking. We offer four granularity combinations for these benchmarks, considering the extent and density of semantic information, thereby showcasing the practicality and versatility of DTLLM-VLT. (3) We conduct comparative experiments on VLT benchmarks with different text granularities, evaluating and analyzing the impact of diverse text on tracking performance. Conclusionally, this work leverages LLM to provide multi-granularity semantic information for VLT task from efficient and diverse perspectives, enabling fine-grained evaluation of multi-modal trackers. In the future, we believe this work can be extended to more datasets to support vision datasets understanding.

Cite

Text

Li et al. "DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00724

Markdown

[Li et al. "DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/li2024cvprw-dtllmvlt/) doi:10.1109/CVPRW63382.2024.00724

BibTeX

@inproceedings{li2024cvprw-dtllmvlt,
  title     = {{DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM}},
  author    = {Li, Xuchen and Feng, Xiaokun and Hu, Shiyu and Wu, Meiqi and Zhang, Dailing and Zhang, Jing and Huang, Kaiqi},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {7283-7292},
  doi       = {10.1109/CVPRW63382.2024.00724},
  url       = {https://mlanthology.org/cvprw/2024/li2024cvprw-dtllmvlt/}
}