Is Temporal Prompting All We Need for Limited Labeled Action Recognition?

Gowda, Shreyank N.; Gao, Boyan; Gu, Xiao; Jin, Xiao-Bo

Is Temporal Prompting All We Need for Limited Labeled Action Recognition?

Shreyank N. Gowda, Boyan Gao, Xiao Gu, Xiao-Bo Jin

CVPRW 2025 pp. 682-692

/cvprw/2025/gowda2025cvprw-temporal/

Abstract

Video understanding has shown remarkable improvements in recent years. Much of this is due to the dependence on large scaled labeled datasets. Recent advancements in research of visual-language models have shown remarkable generalization in zero-shot tasks helping to overcome this dependence on labeled datasets. Adaptations for videos, however, are computationally intensive and struggle with temporal modeling. Adaptations typically involve modifying the architecture of vision-language models to cater to video data. We present TP-CLIP, an adaptation of CLIP that leverages temporal visual prompting for temporal adaptation without modifying the core CLIP architecture. This preserves its generalization abilities. TP-CLIP efficiently integrates into the CLIP architecture, leveraging its pre-trained capabilities for video data. Extensive experiments across various datasets demonstrate its efficacy in zero-shot and few-shot learning, outperforming existing approaches with fewer parameters and computational efficiency. In particular, we use just 1/3 the GFLOPs and 1/28 the number of tuneable parameters in comparison to recent state-of-the-art and still outperform it by up to 15.8% depending on the task and dataset.

PDF CVPRW Semantic Scholar

Cite

Text

Gowda et al. "Is Temporal Prompting All We Need for Limited Labeled Action Recognition?." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Gowda et al. "Is Temporal Prompting All We Need for Limited Labeled Action Recognition?." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/gowda2025cvprw-temporal/)

BibTeX

@inproceedings{gowda2025cvprw-temporal,
  title     = {{Is Temporal Prompting All We Need for Limited Labeled Action Recognition?}},
  author    = {Gowda, Shreyank N. and Gao, Boyan and Gu, Xiao and Jin, Xiao-Bo},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {682-692},
  url       = {https://mlanthology.org/cvprw/2025/gowda2025cvprw-temporal/}
}