Cross-Modal Target Retrieval for Tracking by Natural Language

Yihao Li, Jun Yu, Zhongpeng Cai, Yuwen Pan

CVPRW 2022 pp. 4927-4936

doi:10.1109/CVPRW56347.2022.00540 /cvprw/2022/li2022cvprw-crossmodal/

Abstract

Tracking by natural language specification in a video is a challenging task in computer vision. Distinct from initializing the target state only by the bounding box in the first frame, language specification has a strong potential to assist visual object trackers to capture appearance variation and eliminate semantic ambiguity of the tracked object. In this paper, we carefully design a unified local-global-search framework from the perspective of cross-modal retrieval, including a local tracker, an adaptive retrieval switch module, and a target-specific retrieval module. The adaptive retrieval switch module aligns semantics from the visual signal and the lingual description of the target using three sub-modules, i.e., object-aware attention memory, part-aware cross-attention, and vision-language contrast, which achieve an automatic switch between local search and global search. When booting the global search mechanism, the target-specific retrieval module re-localizes the missing target in the image-wide range via an efficient vision-language guided proposal selector and target-text match. Numerous experimental results on three prevailing benchmarks show the effectiveness and generalization of our framework.

CVPRW Semantic Scholar

Cite

Text

Li et al. "Cross-Modal Target Retrieval for Tracking by Natural Language." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022. doi:10.1109/CVPRW56347.2022.00540

Markdown

[Li et al. "Cross-Modal Target Retrieval for Tracking by Natural Language." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022.](https://mlanthology.org/cvprw/2022/li2022cvprw-crossmodal/) doi:10.1109/CVPRW56347.2022.00540

BibTeX

@inproceedings{li2022cvprw-crossmodal,
  title     = {{Cross-Modal Target Retrieval for Tracking by Natural Language}},
  author    = {Li, Yihao and Yu, Jun and Cai, Zhongpeng and Pan, Yuwen},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2022},
  pages     = {4927-4936},
  doi       = {10.1109/CVPRW56347.2022.00540},
  url       = {https://mlanthology.org/cvprw/2022/li2022cvprw-crossmodal/}
}