Fine-Grained Video-Text Retrieval with Hierarchical Graph Reasoning

Abstract

Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. The current dominant approach is to learn a joint embedding space to measure cross-modal similarities. However, simple embeddings are insufficient to represent complicated visual and textual details, such as scenes, objects, actions and their compositions. To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels. The model disentangles text into a hierarchical semantic graph including three levels of events, actions, entities, and generates hierarchical textual embeddings via attention-based graph reasoning. Different levels of texts can guide the learning of diverse and hierarchical video representations for cross-modal matching to capture both global and local details. Experimental results on three video-text datasets demonstrate the advantages of our model. Such hierarchical decomposition also enables better generalization across datasets and improves the ability to distinguish fine-grained semantic differences. Code will be released at https://github.com/cshizhe/hgr_v2t.

Cite

Text

Chen et al. "Fine-Grained Video-Text Retrieval with Hierarchical Graph Reasoning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. doi:10.1109/CVPR42600.2020.01065

Markdown

[Chen et al. "Fine-Grained Video-Text Retrieval with Hierarchical Graph Reasoning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.](https://mlanthology.org/cvpr/2020/chen2020cvpr-finegrained/) doi:10.1109/CVPR42600.2020.01065

BibTeX

@inproceedings{chen2020cvpr-finegrained,
  title     = {{Fine-Grained Video-Text Retrieval with Hierarchical Graph Reasoning}},
  author    = {Chen, Shizhe and Zhao, Yida and Jin, Qin and Wu, Qi},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2020},
  doi       = {10.1109/CVPR42600.2020.01065},
  url       = {https://mlanthology.org/cvpr/2020/chen2020cvpr-finegrained/}
}