Visual Consensus Modeling for Video-Text Retrieval

Abstract

In this paper, we propose a novel method to mine the commonsense knowledge shared between the video and text modalities for video-text retrieval, namely visual consensus modeling. Different from the existing works, which learn the video and text representations and their complicated relationships solely based on the pairwise video-text data, we make the first attempt to model the visual consensus by mining the visual concepts from videos and exploiting their co-occurrence patterns within the video and text modalities with no reliance on any additional concept annotations. Specifically, we build a shareable and learnable graph as the visual consensus, where the nodes denoting the mined visual concepts and the edges connecting the nodes representing the co-occurrence relationships between the visual concepts. Extensive experimental results on the public benchmark datasets demonstrate that our proposed method, with the ability to effectively model the visual consensus, achieves state-of-the-art performances on the bidirectional video-text retrieval task. Our code is available at https://github.com/sqiangcao99/VCM.

Cite

Text

Cao et al. "Visual Consensus Modeling for Video-Text Retrieval." AAAI Conference on Artificial Intelligence, 2022. doi:10.1609/AAAI.V36I1.19891

Markdown

[Cao et al. "Visual Consensus Modeling for Video-Text Retrieval." AAAI Conference on Artificial Intelligence, 2022.](https://mlanthology.org/aaai/2022/cao2022aaai-visual/) doi:10.1609/AAAI.V36I1.19891

BibTeX

@inproceedings{cao2022aaai-visual,
  title     = {{Visual Consensus Modeling for Video-Text Retrieval}},
  author    = {Cao, Shuqiang and Wang, Bairui and Zhang, Wei and Ma, Lin},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2022},
  pages     = {167-175},
  doi       = {10.1609/AAAI.V36I1.19891},
  url       = {https://mlanthology.org/aaai/2022/cao2022aaai-visual/}
}