Streaming Detection of Queried Event Start

Abstract

Robotics, autonomous driving, augmented reality, and many embodied computer vision applications must quickly react to user-defined events unfolding in real time. We address this setting by proposing a novel task for multimodal video understanding---Streaming Detection of Queried Event Start (SDQES).The goal of SDQES is to identify the beginning of a complex event as described by a natural language query, with high accuracy and low latency. We introduce a new benchmark based on the Ego4D dataset, as well as new task-specific metrics to study streaming multimodal detection of diverse events in an egocentric video setting.Inspired by parameter-efficient fine-tuning methods in NLP and for video tasks, we propose adapter-based baselines that enable image-to-video transfer learning, allowing for efficient online video modeling.We evaluate three vision-language backbones and three adapter architectures on both short-clip and untrimmed video settings.

Cite

Text

Eyzaguirre et al. "Streaming Detection of Queried Event Start." Neural Information Processing Systems, 2024. doi:10.52202/079017-3194

Markdown

[Eyzaguirre et al. "Streaming Detection of Queried Event Start." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/eyzaguirre2024neurips-streaming/) doi:10.52202/079017-3194

BibTeX

@inproceedings{eyzaguirre2024neurips-streaming,
  title     = {{Streaming Detection of Queried Event Start}},
  author    = {Eyzaguirre, Cristóbal and Tang, Eric and Buch, Shyamal and Gaidon, Adrien and Wu, Jiajun and Niebles, Juan Carlos},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-3194},
  url       = {https://mlanthology.org/neurips/2024/eyzaguirre2024neurips-streaming/}
}