Unifying Visual and Vision-Language Tracking via Contrastive Learning

Ma, Yinchao; Tang, Yuyang; Yang, Wenfei; Zhang, Tianzhu; Zhang, Jinpeng; Kang, Mengxue

doi:10.1609/AAAI.V38I5.28205

Unifying Visual and Vision-Language Tracking via Contrastive Learning

Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang, Mengxue Kang

AAAI 2024 pp. 4107-4116

doi:10.1609/AAAI.V38I5.28205 /aaai/2024/ma2024aaai-unifying/

Abstract

Single object tracking aims to locate the target object in a video sequence according to the state specified by different modal references, including the initial bounding box (BBOX), natural language (NL), or both (NL+BBOX). Due to the gap between different modalities, most existing trackers are designed for single or partial of these reference settings and overspecialize on the specific modality. Differently, we present a unified tracker called UVLTrack, which can simultaneously handle all three reference settings (BBOX, NL, NL+BBOX) with the same parameters. The proposed UVLTrack enjoys several merits. First, we design a modality-unified feature extractor for joint visual and language feature learning and propose a multi-modal contrastive loss to align the visual and language features into a unified semantic space. Second, a modality-adaptive box head is proposed, which makes full use of the target reference to mine ever-changing scenario features dynamically from video contexts and distinguish the target in a contrastive way, enabling robust performance in different reference settings. Extensive experimental results demonstrate that UVLTrack achieves promising performance on seven visual tracking datasets, three vision-language tracking datasets, and three visual grounding datasets. Codes and models will be open-sourced at https://github.com/OpenSpaceAI/UVLTrack.

PDF AAAI Semantic Scholar

Cite

Text

Ma et al. "Unifying Visual and Vision-Language Tracking via Contrastive Learning." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I5.28205

Markdown

[Ma et al. "Unifying Visual and Vision-Language Tracking via Contrastive Learning." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/ma2024aaai-unifying/) doi:10.1609/AAAI.V38I5.28205

BibTeX

@inproceedings{ma2024aaai-unifying,
  title     = {{Unifying Visual and Vision-Language Tracking via Contrastive Learning}},
  author    = {Ma, Yinchao and Tang, Yuyang and Yang, Wenfei and Zhang, Tianzhu and Zhang, Jinpeng and Kang, Mengxue},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {4107-4116},
  doi       = {10.1609/AAAI.V38I5.28205},
  url       = {https://mlanthology.org/aaai/2024/ma2024aaai-unifying/}
}