On Semantic Similarity in Video Retrieval

Abstract

Current video retrieval efforts all found their evaluation on an instance-based assumption, that only a single caption is relevant to a query video and vice versa. We demonstrate that this assumption results in performance comparisons often not indicative of models' retrieval capabilities. We propose a move to semantic similarity video retrieval, where (i) multiple videos/captions can be deemed equally relevant, and their relative ranking does not affect a method's reported performance and (ii) retrieved videos/captions are ranked by their similarity to a query. We propose several proxies to estimate semantic similarities in large-scale retrieval datasets, without additional annotations. Our analysis is performed on three commonly used video retrieval datasets (MSR-VTT, YouCook2 and EPIC-KITCHENS).

Cite

Text

Wray et al. "On Semantic Similarity in Video Retrieval." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.00365

Markdown

[Wray et al. "On Semantic Similarity in Video Retrieval." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/wray2021cvpr-semantic/) doi:10.1109/CVPR46437.2021.00365

BibTeX

@inproceedings{wray2021cvpr-semantic,
  title     = {{On Semantic Similarity in Video Retrieval}},
  author    = {Wray, Michael and Doughty, Hazel and Damen, Dima},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {3650-3660},
  doi       = {10.1109/CVPR46437.2021.00365},
  url       = {https://mlanthology.org/cvpr/2021/wray2021cvpr-semantic/}
}