Knowing Where to Focus: Event-Aware Transformer for Video Grounding
Abstract
Recent DETR-based video grounding models have made the model directly predict moment timestamps without any hand-crafted components, such as a pre-defined proposal or non-maximum suppression, by learning moment queries. However, their input-agnostic moment queries inevitably overlook an intrinsic temporal structure of a video, providing limited positional information. In this paper, we formulate an event-aware dynamic moment query to enable the model to take the input-specific content and positional information of the video into account. To this end, we present two levels of reasoning: 1) Event reasoning that captures distinctive event units constituting a given video using a slot attention mechanism; and 2) moment reasoning that fuses the moment queries with a given sentence through a gated fusion transformer layer and learns interactions between the moment queries and video-sentence representations to predict moment timestamps. Extensive experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks. The code is publicly available at https://github.com/jinhyunj/EaTR.
Cite
Text
Jang et al. "Knowing Where to Focus: Event-Aware Transformer for Video Grounding." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01273Markdown
[Jang et al. "Knowing Where to Focus: Event-Aware Transformer for Video Grounding." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/jang2023iccv-knowing/) doi:10.1109/ICCV51070.2023.01273BibTeX
@inproceedings{jang2023iccv-knowing,
title = {{Knowing Where to Focus: Event-Aware Transformer for Video Grounding}},
author = {Jang, Jinhyun and Park, Jungin and Kim, Jin and Kwon, Hyeongjun and Sohn, Kwanghoon},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {13846-13856},
doi = {10.1109/ICCV51070.2023.01273},
url = {https://mlanthology.org/iccv/2023/jang2023iccv-knowing/}
}