VTC: Improving Video-Text Retrieval with User Comments

Abstract

Multi-modal retrieval is an important problem for many applications, such as recommendation and search. Current benchmarks and even datasets are often manually constructed and consist of mostly clean samples where all modalities are well-correlated with the content. Thus, current video-text retrieval literature largely focuses on video titles or audio transcripts, while ignoring user comments, since users often tend to discuss topics only vaguely related to the video. Despite the ubiquity of user comments online, there is currently no multi-modal representation learning datasets that includes comments. In this paper, we a) introduce a new dataset of videos, titles and comments; b) present an attention-based mechanism that allows the model to learn from sometimes irrelevant data such as comments; c) show that by using comments, our method is able to learn better, more contextualised, representations for image, video and audio representations.

Cite

Text

Hanu et al. "VTC: Improving Video-Text Retrieval with User Comments." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19833-5_36

Markdown

[Hanu et al. "VTC: Improving Video-Text Retrieval with User Comments." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/hanu2022eccv-vtc/) doi:10.1007/978-3-031-19833-5_36

BibTeX

@inproceedings{hanu2022eccv-vtc,
  title     = {{VTC: Improving Video-Text Retrieval with User Comments}},
  author    = {Hanu, Laura and Thewlis, James and Asano, Yuki M. and Rupprecht, Christian},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19833-5_36},
  url       = {https://mlanthology.org/eccv/2022/hanu2022eccv-vtc/}
}