Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

Abstract

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a presegmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet’18 DenseCaption dataset (Krishna et al. 2017) and Charades-STA dataset (Sigurdsson et al. 2016; Gao et al. 2017) while observing only 10 or less clips per video.

Cite

Text

He et al. "Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos." AAAI Conference on Artificial Intelligence, 2019. doi:10.1609/AAAI.V33I01.33018393

Markdown

[He et al. "Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos." AAAI Conference on Artificial Intelligence, 2019.](https://mlanthology.org/aaai/2019/he2019aaai-read/) doi:10.1609/AAAI.V33I01.33018393

BibTeX

@inproceedings{he2019aaai-read,
  title     = {{Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos}},
  author    = {He, Dongliang and Zhao, Xiang and Huang, Jizhou and Li, Fu and Liu, Xiao and Wen, Shilei},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2019},
  pages     = {8393-8400},
  doi       = {10.1609/AAAI.V33I01.33018393},
  url       = {https://mlanthology.org/aaai/2019/he2019aaai-read/}
}