SnAG: Scalable and Accurate Video Grounding

Abstract

Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability --- they have been optimized for grounding only a few text queries within short videos and fail to scale up to long videos with hundreds of queries. In this paper we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover it leads us to a novel video-centric sampling scheme for efficient training. Based on these findings we present SnAG a simple baseline for scalable and accurate video grounding. Without bells and whistles SnAG is 43% more accurate and 1.5x faster than CONE a state of the art for long-form video grounding on the challenging MAD dataset while achieving highly competitive results on short videos.

Cite

Text

Mu et al. "SnAG: Scalable and Accurate Video Grounding." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01791

Markdown

[Mu et al. "SnAG: Scalable and Accurate Video Grounding." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/mu2024cvpr-snag/) doi:10.1109/CVPR52733.2024.01791

BibTeX

@inproceedings{mu2024cvpr-snag,
  title     = {{SnAG: Scalable and Accurate Video Grounding}},
  author    = {Mu, Fangzhou and Mo, Sicheng and Li, Yin},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {18930-18940},
  doi       = {10.1109/CVPR52733.2024.01791},
  url       = {https://mlanthology.org/cvpr/2024/mu2024cvpr-snag/}
}