Ego-VPA: Egocentric Video Understanding with Parameter-Efficient Adaptation

Abstract

Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts and the selected basis prompts are used to synthesize video/text prompts. Since the basis prompts are shared across frames and modalities it models context fusion and cross-modal transfer in an efficient fashion. Experiments show that Ego-VPA excels in lightweight adaptation (with only 0.84% learnable parameters) largely improving over baselines and reaching the performance of full fine-tuning.

Cite

Text

Wu et al. "Ego-VPA: Egocentric Video Understanding with Parameter-Efficient Adaptation." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Wu et al. "Ego-VPA: Egocentric Video Understanding with Parameter-Efficient Adaptation." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/wu2025wacv-egovpa/)

BibTeX

@inproceedings{wu2025wacv-egovpa,
  title     = {{Ego-VPA: Egocentric Video Understanding with Parameter-Efficient Adaptation}},
  author    = {Wu, Tz-Ying and Min, Kyle and Tripathi, Subarna and Vasconcelos, Nuno},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {9240-9250},
  url       = {https://mlanthology.org/wacv/2025/wu2025wacv-egovpa/}
}