Single-Stage Visual Query Localization in Egocentric Videos

Abstract

Visual Query Localization on long-form egocentric videos requires spatio-temporal search and localization of visually specified objects and is vital to build episodic memory systems. Prior work develops complex multi-stage pipelines that leverage well-established object detection and tracking methods to perform VQL. However, each stage is independently trained and the complexity of the pipeline results in slow inference speeds. We propose VQLoC, a novel single-stage VQL framework that is end-to-end trainable. Our key idea is to first build a holistic understanding of the query-video relationship and then perform spatio-temporal localization in a single shot manner. Specifically, we establish the query-video relationship by jointly considering query-to-frame correspondences between the query and each video frame and frame-to-frame correspondences between nearby video frames. Our experiments demonstrate that our approach outperforms prior VQL methods by $20$% accuracy while obtaining a $10\times$ improvement in inference speed. VQLoC is also the top entry on the Ego4D VQ2D challenge leaderboard.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Jiang et al. "Single-Stage Visual Query Localization in Egocentric Videos." Neural Information Processing Systems, 2023.

Markdown

[Jiang et al. "Single-Stage Visual Query Localization in Egocentric Videos." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/jiang2023neurips-singlestage/)

BibTeX

@inproceedings{jiang2023neurips-singlestage,
  title     = {{Single-Stage Visual Query Localization in Egocentric Videos}},
  author    = {Jiang, Hanwen and Ramakrishnan, Santhosh Kumar and Grauman, Kristen},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/jiang2023neurips-singlestage/}
}