CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding

Abstract

This paper studies the spatio-temporal video grounding task, which aims to localize a spatio-temporal tube in an untrimmed video based on the given text description of an event. Existing one-stage approaches suffer from insufficient space-time interaction in two aspects: i) less precise prediction of event temporal boundaries, and ii) inconsistency in object prediction for the same event across adjacent frames. To address these issues, we propose a framework of Comprehensive Space-Time entAnglement (CoSTA) to densely entangle space-time multi-modal features for spatio-temporal localization. Specifically, we propose a space-time collaborative encoder to extract comprehensive video features and leverage Transformer to perform spatio-temporal multi-modal understanding. Our entangled decoder couples temporal boundary prediction and spatial localization via an entangled query, boasting an enhanced ability to capture object-event relationships. We conduct extensive experiments on the challenging benchmarks of HC-STVG and VidSTG, where CoSTA outperforms existing state-of-the-art methods, demonstrating its effectiveness for this task.

Cite

Text

Liang et al. "CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I4.28118

Markdown

[Liang et al. "CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/liang2024aaai-costa/) doi:10.1609/AAAI.V38I4.28118

BibTeX

@inproceedings{liang2024aaai-costa,
  title     = {{CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding}},
  author    = {Liang, Yaoyuan and Liang, Xiao and Tang, Yansong and Yang, Zhao and Li, Ziran and Wang, Jingang and Ding, Wenbo and Huang, Shao-Lun},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {3324-3332},
  doi       = {10.1609/AAAI.V38I4.28118},
  url       = {https://mlanthology.org/aaai/2024/liang2024aaai-costa/}
}