Text-Infused Attention and Foreground-Aware Modeling for Zero-Shot Temporal Action Detection
Abstract
Zero-Shot Temporal Action Detection (ZSTAD) aims to classify and localize action segments in untrimmed videos for unseen action categories. Most existing ZSTAD methods utilize a foreground-based approach, limiting the integration of text and visual features due to their reliance on pre-extracted proposals. In this paper, we introduce a cross-modal ZSTAD baseline with mutual cross-attention, integrating both text and visual information throughout the detection process. Our simple approach results in superior performance compared to previous methods. Despite this improvement, we further identify a common-action bias issue that the cross-modal baseline over-focus on common sub-actions due to a lack of ability to discriminate text-related visual parts. To address this issue, we propose Text-infused attention and Foreground-aware Action Detection (Ti-FAD), which enhances the ability to focus on text-related sub-actions and distinguish relevant action segments from the background. Our extensive experiments demonstrate that Ti-FAD outperforms the state-of-the-art methods on ZSTAD benchmarks by a large margin: 41.2\% (+ 11.0\%) on THUMOS14 and 32.0\% (+ 5.4\%) on ActivityNet v1.3. Code is available at: https://github.com/YearangLee/Ti-FAD.
Cite
Text
Lee et al. "Text-Infused Attention and Foreground-Aware Modeling for Zero-Shot Temporal Action Detection." Neural Information Processing Systems, 2024. doi:10.52202/079017-0316Markdown
[Lee et al. "Text-Infused Attention and Foreground-Aware Modeling for Zero-Shot Temporal Action Detection." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/lee2024neurips-textinfused/) doi:10.52202/079017-0316BibTeX
@inproceedings{lee2024neurips-textinfused,
title = {{Text-Infused Attention and Foreground-Aware Modeling for Zero-Shot Temporal Action Detection}},
author = {Lee, Yearang and Kim, Ho-Joong and Lee, Seong-Whan},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-0316},
url = {https://mlanthology.org/neurips/2024/lee2024neurips-textinfused/}
}