LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval
Abstract
The goal of weakly-supervised video moment retrieval is to localize the video segment most relevant to a description without access to temporal annotations during training. Prior work uses co-attention mechanisms to understand relationships between the vision and language data, but they lack contextual information between video frames that can be useful to determine how well a segment relates to the query. To address this, we propose an efficient Latent Graph Co-Attention Network (LoGAN) that exploits fine-grained frame-by-word interactions to jointly reason about the correspondences between all possible pairs of frames, providing context cues absent in prior work. Experiments on the DiDeMo and Charades-STA datasets demonstrate the effectiveness of our approach, where we improve Recall@1 by 5-20% over prior weakly-supervised methods, even boasting an 11% gain over strongly-supervised methods on DiDeMo, while also using significantly fewer model parameters than other co-attention mechanisms.
Cite
Text
Tan et al. "LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval." Winter Conference on Applications of Computer Vision, 2021.Markdown
[Tan et al. "LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval." Winter Conference on Applications of Computer Vision, 2021.](https://mlanthology.org/wacv/2021/tan2021wacv-logan/)BibTeX
@inproceedings{tan2021wacv-logan,
title = {{LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval}},
author = {Tan, Reuben and Xu, Huijuan and Saenko, Kate and Plummer, Bryan A.},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2021},
pages = {2083-2092},
url = {https://mlanthology.org/wacv/2021/tan2021wacv-logan/}
}