Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization

Abstract

The vocabulary size in temporal action localization (TAL) is limited by the scarcity of large-scale annotated datasets. To overcome this recent works integrate vision-language models (VLMs) such as CLIP for open-vocabulary TAL (OV-TAL). However despite the success of VLMs trained on extensive datasets existing OV-TAL methods still rely on human-labeled TAL datasets of limited size to train action localizers limiting their generalizability. In this paper we explore the scalability of self-training with unlabeled YouTube videos for OV-TAL. Our approach consists of two stages: (1) a class-agnostic action localizer is trained on a human-labeled TAL dataset to generate pseudo-labels for unlabeled videos and (2) the large-scale pseudo-labeled dataset is then used to train the localizer. Extensive experiments demonstrate that leveraging web-scale videos in self-training significantly enhances the generalizability of an action localizer. Additionally we identify limitations in existing OV-TAL evaluation schemes and propose a new benchmark for thorough assessment. Finally we showcase the TAL performance of the large multimodal model Gemini-1.5 on our new benchmark. Code is released at https://github.com/HYUNJS/STOV-TAL.

Cite

Text

Hyun et al. "Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Hyun et al. "Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/hyun2025wacv-exploring/)

BibTeX

@inproceedings{hyun2025wacv-exploring,
  title     = {{Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization}},
  author    = {Hyun, Jeongseok and Han, Su Ho and Kang, Hyolim and Lee, Joon-Young and Kim, Seon Joo},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {9388-9397},
  url       = {https://mlanthology.org/wacv/2025/hyun2025wacv-exploring/}
}