Local-Global Video-Text Interactions for Temporal Grounding

Mun, Jonghwan; Cho, Minsu; Han, Bohyung

doi:10.1109/CVPR42600.2020.01082

Local-Global Video-Text Interactions for Temporal Grounding

Jonghwan Mun, Minsu Cho, Bohyung Han

CVPR 2020

doi:10.1109/CVPR42600.2020.01082 /cvpr/2020/mun2020cvpr-localglobal/

Abstract

This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query. We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query, which corresponds to important semantic entities described in the query (e.g., actors, objects, and actions), and reflect bi-modal interactions between the linguistic features of the query and the visual features of the video in multiple levels. The proposed method effectively predicts the target time interval by exploiting contextual information from local to global during bi-modal interactions. Through in-depth ablation studies, we find out that incorporating both local and global context in video and text interactions is crucial to the accurate grounding. Our experiment shows that the proposed method outperforms the state of the arts on Charades-STA and ActivityNet Captions datasets by large margins, 7.44% and 4.61% points at Recall@tIoU=0.5 metric, respectively.

PDF CVPR Semantic Scholar

Cite

Text

Mun et al. "Local-Global Video-Text Interactions for Temporal Grounding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. doi:10.1109/CVPR42600.2020.01082

Markdown

[Mun et al. "Local-Global Video-Text Interactions for Temporal Grounding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.](https://mlanthology.org/cvpr/2020/mun2020cvpr-localglobal/) doi:10.1109/CVPR42600.2020.01082

BibTeX

@inproceedings{mun2020cvpr-localglobal,
  title     = {{Local-Global Video-Text Interactions for Temporal Grounding}},
  author    = {Mun, Jonghwan and Cho, Minsu and Han, Bohyung},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2020},
  doi       = {10.1109/CVPR42600.2020.01082},
  url       = {https://mlanthology.org/cvpr/2020/mun2020cvpr-localglobal/}
}