JTD-UAV: MLLM-Enhanced Joint Tracking and Description Framework for Anti-UAV Systems

Abstract

Unmanned Aerial Vehicles (UAVs) are widely adopted across various fields, yet they raise significant privacy and safety concerns, demanding robust monitoring solutions. Existing anti-UAV methods primarily focus on position tracking but fail to capture UAV behavior and intent. To address this, we introduce a novel task--UAV Tracking and Intent Understanding (UTIU)--which aims to track UAVs while inferring and describing their motion states and intent for a more comprehensive monitoring approach. To tackle the task, we propose JTD-UAV, the first joint tracking, and intent description framework based on large language models. Our dual-branch architecture integrates UAV tracking with Visual Question Answering (VQA), allowing simultaneous localization and behavior description. To benchmark this task, we introduce the TDUAV dataset, the largest dataset for joint UAV tracking and intent understanding, featuring 1,328 challenging video sequences, over 163K annotated thermal frames, and 3K VQA pairs. Our benchmark demonstrates the effectiveness of JTD-UAV, and both the dataset and code will be publicly available.

Cite

Text

Wang et al. "JTD-UAV: MLLM-Enhanced Joint Tracking and Description Framework for Anti-UAV Systems." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00160

Markdown

[Wang et al. "JTD-UAV: MLLM-Enhanced Joint Tracking and Description Framework for Anti-UAV Systems." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/wang2025cvpr-jtduav/) doi:10.1109/CVPR52734.2025.00160

BibTeX

@inproceedings{wang2025cvpr-jtduav,
  title     = {{JTD-UAV: MLLM-Enhanced Joint Tracking and Description Framework for Anti-UAV Systems}},
  author    = {Wang, Yifan and Zhao, Jian and Fan, Zhaoxin and Zhang, Xin and Wu, Xuecheng and Zhang, Yudian and Jin, Lei and Li, Xinyue and Wang, Gang and Jia, Mengxi and Hu, Ping and Zhu, Zheng and Li, Xuelong},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {1633-1644},
  doi       = {10.1109/CVPR52734.2025.00160},
  url       = {https://mlanthology.org/cvpr/2025/wang2025cvpr-jtduav/}
}