Structured Multi-Level Interaction Network for Video Moment Localization via Language Query

Abstract

We address the problem of localizing a specific moment described by a natural language query. Existing works interact the query with either video frame or moment proposal, and neglect the inherent structure of moment construction for both cross-modal understanding and video content comprehension, which are the two crucial challenges for this task. In this paper, we disentangle the activity moment into boundary and content. Based on the explored moment structure, we propose a novel Structured Multi-level Interaction Network (SMIN) to tackle this problem through multi-levels of cross-modal interaction coupled with content-boundary-moment interaction. In particular, for cross-modal interaction, we interact the sentence-level query with the whole moment while interact the word-level query with content and boundary, as in a coarse-to-fine manner. For content-boundary-moment interaction, we capture the insightful relations between boundary, content, and the whole moment proposal. Through multi-level interactions, the model obtains robust cross-modal representation for accurate moment localization. Extensive experiments conducted on three benchmarks (i.e., Charades-STA, ActivityNet-Captions, and TACoS) demonstrate the proposed approach outperforms the state-of-the-art methods.

Cite

Text

Wang et al. "Structured Multi-Level Interaction Network for Video Moment Localization via Language Query." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.00695

Markdown

[Wang et al. "Structured Multi-Level Interaction Network for Video Moment Localization via Language Query." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/wang2021cvpr-structured/) doi:10.1109/CVPR46437.2021.00695

BibTeX

@inproceedings{wang2021cvpr-structured,
  title     = {{Structured Multi-Level Interaction Network for Video Moment Localization via Language Query}},
  author    = {Wang, Hao and Zha, Zheng-Jun and Li, Liang and Liu, Dong and Luo, Jiebo},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {7026-7035},
  doi       = {10.1109/CVPR46437.2021.00695},
  url       = {https://mlanthology.org/cvpr/2021/wang2021cvpr-structured/}
}