Fast Video Moment Retrieval

Abstract

This paper targets at fast video moment retrieval (fast VMR), aiming to localize the target moment efficiently and accurately as queried by a given natural language sentence. We argue that most existing VMR approaches can be divided into three modules namely video encoder, text encoder, and cross-modal interaction module, where the last module is the test-time computational bottleneck. To tackle this issue, we replace the cross-modal interaction module with a cross-modal common space, in which moment-query alignment is learned and efficient moment search can be performed. For the sake of robustness in the learned space, we propose a fine-grained semantic distillation framework to transfer knowledge from additional semantic structures. Specifically, we build a semantic role tree that decomposes a query sentence into different phrases (subtrees). A hierarchical semantic-guided attention module is designed to perform message propagation across the whole tree and yield discriminative features. Finally, the important and discriminative semantics are transferred to the common space by a matching-score distillation process. Extensive experimental results on three popular VMR benchmarks demonstrate that our proposed method enjoys the merits of high speed and significant performance.

Cite

Text

Gao and Xu. "Fast Video Moment Retrieval." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00155

Markdown

[Gao and Xu. "Fast Video Moment Retrieval." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/gao2021iccv-fast/) doi:10.1109/ICCV48922.2021.00155

BibTeX

@inproceedings{gao2021iccv-fast,
  title     = {{Fast Video Moment Retrieval}},
  author    = {Gao, Junyu and Xu, Changsheng},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {1523-1532},
  doi       = {10.1109/ICCV48922.2021.00155},
  url       = {https://mlanthology.org/iccv/2021/gao2021iccv-fast/}
}