LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval

Abstract

The goal of weakly-supervised video moment retrieval is to localize the video segment most relevant to a description without access to temporal annotations during training. Prior work uses co-attention mechanisms to understand relationships between the vision and language data, but they lack contextual information between video frames that can be useful to determine how well a segment relates to the query. To address this, we propose an efficient Latent Graph Co-Attention Network (LoGAN) that exploits fine-grained frame-by-word interactions to jointly reason about the correspondences between all possible pairs of frames, providing context cues absent in prior work. Experiments on the DiDeMo and Charades-STA datasets demonstrate the effectiveness of our approach, where we improve Recall@1 by 5-20% over prior weakly-supervised methods, even boasting an 11% gain over strongly-supervised methods on DiDeMo, while also using significantly fewer model parameters than other co-attention mechanisms.

Cite

Text

Tan et al. "LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval." Winter Conference on Applications of Computer Vision, 2021.

Markdown

[Tan et al. "LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval." Winter Conference on Applications of Computer Vision, 2021.](https://mlanthology.org/wacv/2021/tan2021wacv-logan/)

BibTeX

@inproceedings{tan2021wacv-logan,
  title     = {{LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval}},
  author    = {Tan, Reuben and Xu, Huijuan and Saenko, Kate and Plummer, Bryan A.},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2021},
  pages     = {2083-2092},
  url       = {https://mlanthology.org/wacv/2021/tan2021wacv-logan/}
}