Localizing Natural Language in Videos

Abstract

In this paper, we consider the task of natural language video localization (NLVL): given an untrimmed video and a natural language description, the goal is to localize a segment in the video which semantically corresponds to the given natural language description. We propose a localizing network (LNet), working in an end-to-end fashion, to tackle the NLVL task. We first match the natural sentence and video sequence by cross-gated attended recurrent networks to exploit their fine-grained interactions and generate a sentence-aware video representation. A self interactor is proposed to perform crossframe matching, which dynamically encodes and aggregates the matching evidences. Finally, a boundary model is proposed to locate the positions of video segments corresponding to the natural sentence description by predicting the starting and ending points of the segment. Extensive experiments conducted on the public TACoS and DiDeMo datasets demonstrate that our proposed model performs effectively and efficiently against the state-of-the-art approaches.

Cite

Text

Chen et al. "Localizing Natural Language in Videos." AAAI Conference on Artificial Intelligence, 2019. doi:10.1609/AAAI.V33I01.33018175

Markdown

[Chen et al. "Localizing Natural Language in Videos." AAAI Conference on Artificial Intelligence, 2019.](https://mlanthology.org/aaai/2019/chen2019aaai-localizing/) doi:10.1609/AAAI.V33I01.33018175

BibTeX

@inproceedings{chen2019aaai-localizing,
  title     = {{Localizing Natural Language in Videos}},
  author    = {Chen, Jingyuan and Ma, Lin and Chen, Xinpeng and Jie, Zequn and Luo, Jiebo},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2019},
  pages     = {8175-8182},
  doi       = {10.1609/AAAI.V33I01.33018175},
  url       = {https://mlanthology.org/aaai/2019/chen2019aaai-localizing/}
}