Few-Shot Audio-Visual Class-Incremental Learning with Temporal Prompting and Regularization
Abstract
Audio-Visual Learning (AVL) aims at the audio-visual perception with both audio and vision modalities. AVL also suffers from data insufficiency in many applications as with other unimodal tasks. Concurrently, AVL often needs to continuously learn over time rather than all knowledge simultaneously. Considering the above two perspectives, our work mainly focuses on benchmarking the unexplored Few-Shot Audio-Visual Class-Incremental Learning (FS-AVCIL), i.e., continually perceiving novel categories described by a limited number of labeled examples with audio and visual modalities. Firstly, we provide the detailed task configuration together with a thorough analysis of the challenges in FS-AVCIL: (1) how to efficiently learn and fuse multimodal information with limited labeled examples; and (2) how to alleviate catastrophic forgetting cross-modal semantic correlations with limited data. Then, we propose an efficient framework based on Vision Transformer to solve FS-AVCIL. This framework contains two parts: temporal-residual prompting for audio-visual synergy adapter and temporal prompt regularization. Specifically, temporal-residual prompting is incorporated into the audio-visual adapter to efficiently finetune the pre-trained foundation model with limited data and capture audio-visual correlation by learning temporal-relevant prompts. Besides, we regularize temporal-relevant prompts to memorize previous knowledge by fully using the temporal knowledge from various perspectives. This framework is validated in audio-visual classification tasks under the FS-AVCIL scenario, and extensive experiments demonstrate its superior performance.
Cite
Text
Cui et al. "Few-Shot Audio-Visual Class-Incremental Learning with Temporal Prompting and Regularization." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I15.33770Markdown
[Cui et al. "Few-Shot Audio-Visual Class-Incremental Learning with Temporal Prompting and Regularization." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/cui2025aaai-few/) doi:10.1609/AAAI.V39I15.33770BibTeX
@inproceedings{cui2025aaai-few,
title = {{Few-Shot Audio-Visual Class-Incremental Learning with Temporal Prompting and Regularization}},
author = {Cui, Yawen and Liu, Li and Yu, Zitong and Huang, Guanjie and Hong, Xiaopeng},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {16118-16126},
doi = {10.1609/AAAI.V39I15.33770},
url = {https://mlanthology.org/aaai/2025/cui2025aaai-few/}
}