Cross-Modal Target Retrieval for Tracking by Natural Language
Abstract
Tracking by natural language specification in a video is a challenging task in computer vision. Distinct from initializing the target state only by the bounding box in the first frame, language specification has a strong potential to assist visual object trackers to capture appearance variation and eliminate semantic ambiguity of the tracked object. In this paper, we carefully design a unified local-global-search framework from the perspective of cross-modal retrieval, including a local tracker, an adaptive retrieval switch module, and a target-specific retrieval module. The adaptive retrieval switch module aligns semantics from the visual signal and the lingual description of the target using three sub-modules, i.e., object-aware attention memory, part-aware cross-attention, and vision-language contrast, which achieve an automatic switch between local search and global search. When booting the global search mechanism, the target-specific retrieval module re-localizes the missing target in the image-wide range via an efficient vision-language guided proposal selector and target-text match. Numerous experimental results on three prevailing benchmarks show the effectiveness and generalization of our framework.
Cite
Text
Li et al. "Cross-Modal Target Retrieval for Tracking by Natural Language." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022. doi:10.1109/CVPRW56347.2022.00540Markdown
[Li et al. "Cross-Modal Target Retrieval for Tracking by Natural Language." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022.](https://mlanthology.org/cvprw/2022/li2022cvprw-crossmodal/) doi:10.1109/CVPRW56347.2022.00540BibTeX
@inproceedings{li2022cvprw-crossmodal,
title = {{Cross-Modal Target Retrieval for Tracking by Natural Language}},
author = {Li, Yihao and Yu, Jun and Cai, Zhongpeng and Pan, Yuwen},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2022},
pages = {4927-4936},
doi = {10.1109/CVPRW56347.2022.00540},
url = {https://mlanthology.org/cvprw/2022/li2022cvprw-crossmodal/}
}