Commonsense for Zero-Shot Natural Language Video Localization

Holla, Meghana; Lourentzou, Ismini

doi:10.1609/AAAI.V38I3.27989

Commonsense for Zero-Shot Natural Language Video Localization

Meghana Holla, Ismini Lourentzou

AAAI 2024 pp. 2166-2174

doi:10.1609/AAAI.V38I3.27989 /aaai/2024/holla2024aaai-commonsense/

Abstract

Zero-shot Natural Language-Video Localization (NLVL) methods have exhibited promising results in training NLVL models exclusively with raw video data by dynamically generating video segments and pseudo-query annotations. However, existing pseudo-queries often lack grounding in the source video, resulting in unstructured and disjointed content. In this paper, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that leverages commonsense to bridge the gap between videos and generated pseudo-queries via a commonsense enhancement module. CORONET employs Graph Convolution Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query representations prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that CORONET surpasses both zero-shot and weakly supervised baselines, achieving improvements up to 32.13% across various recall thresholds and up to 6.33% in mIoU. These results underscore the significance of leveraging commonsense reasoning for zero-shot NLVL.

PDF AAAI Semantic Scholar

Cite

Text

Holla and Lourentzou. "Commonsense for Zero-Shot Natural Language Video Localization." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I3.27989

Markdown

[Holla and Lourentzou. "Commonsense for Zero-Shot Natural Language Video Localization." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/holla2024aaai-commonsense/) doi:10.1609/AAAI.V38I3.27989

BibTeX

@inproceedings{holla2024aaai-commonsense,
  title     = {{Commonsense for Zero-Shot Natural Language Video Localization}},
  author    = {Holla, Meghana and Lourentzou, Ismini},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {2166-2174},
  doi       = {10.1609/AAAI.V38I3.27989},
  url       = {https://mlanthology.org/aaai/2024/holla2024aaai-commonsense/}
}