Ego-VPA: Egocentric Video Understanding with Parameter-Efficient Adaptation
Abstract
Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts and the selected basis prompts are used to synthesize video/text prompts. Since the basis prompts are shared across frames and modalities it models context fusion and cross-modal transfer in an efficient fashion. Experiments show that Ego-VPA excels in lightweight adaptation (with only 0.84% learnable parameters) largely improving over baselines and reaching the performance of full fine-tuning.
Cite
Text
Wu et al. "Ego-VPA: Egocentric Video Understanding with Parameter-Efficient Adaptation." Winter Conference on Applications of Computer Vision, 2025.Markdown
[Wu et al. "Ego-VPA: Egocentric Video Understanding with Parameter-Efficient Adaptation." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/wu2025wacv-egovpa/)BibTeX
@inproceedings{wu2025wacv-egovpa,
title = {{Ego-VPA: Egocentric Video Understanding with Parameter-Efficient Adaptation}},
author = {Wu, Tz-Ying and Min, Kyle and Tripathi, Subarna and Vasconcelos, Nuno},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2025},
pages = {9240-9250},
url = {https://mlanthology.org/wacv/2025/wu2025wacv-egovpa/}
}