UATVR: Uncertainty-Adaptive Text-Video Retrieval

Abstract

With the explosive growth of web videos and emerging large-scale vision-language pre-training models, e.g., CLIP, retrieving videos of interest with text instructions has attracted increasing attention. A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities in specific granularities for semantic correspondence. Unfortunately, the intrinsic uncertainties of optimal entity combinations in appropriate granularities for cross-modal queries are understudied, which is especially critical for modalities with hierarchical semantics, e.g., video, text, etc. In this paper, we propose an Uncertainty-Adaptive Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure. Concretely, we add additional learnable tokens in the encoders to adaptively aggregate multi-grained semantics for flexible high-level reasoning. In the refined embedding space, we represent text-video pairs as probabilistic distributions where prototypes are sampled for matching evaluation. Comprehensive experiments on four benchmarks justify the superiority of our UATVR, which achieves new state-of-the-art results on MSR-VTT (50.8%), VATEX (64.5%), MSVD (49.7%), and DiDeMo (45.8%). The code is available at https://github.com/bofang98/UATVR.

Cite

Text

Fang et al. "UATVR: Uncertainty-Adaptive Text-Video Retrieval." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01262

Markdown

[Fang et al. "UATVR: Uncertainty-Adaptive Text-Video Retrieval." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/fang2023iccv-uatvr/) doi:10.1109/ICCV51070.2023.01262

BibTeX

@inproceedings{fang2023iccv-uatvr,
  title     = {{UATVR: Uncertainty-Adaptive Text-Video Retrieval}},
  author    = {Fang, Bo and Wu, Wenhao and Liu, Chang and Zhou, Yu and Song, Yuxin and Wang, Weiping and Shu, Xiangbo and Ji, Xiangyang and Wang, Jingdong},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {13723-13733},
  doi       = {10.1109/ICCV51070.2023.01262},
  url       = {https://mlanthology.org/iccv/2023/fang2023iccv-uatvr/}
}