Audio-Visual Class-Incremental Learning

Abstract

In this paper, we introduce audio-visual class-incremental learning, a class-incremental learning scenario for audio-visual video recognition. We demonstrate that joint audio-visual modeling can improve class-incremental learning, but current methods fail to preserve semantic similarity between audio and visual features as incremental step grows. Furthermore, we observe that audio-visual correlations learned in previous tasks can be forgotten as incremental steps progress, leading to poor performance. To overcome these challenges, we propose AV-CIL, which incorporates Dual-Audio-Visual Similarity Constraint (D-AVSC) to maintain both instance-aware and class-aware semantic similarity between audio-visual modalities and Visual Attention Distillation (VAD) to retain previously learned audio-guided visual attentive ability. We create three audio-visual class-incremental datasets, AVE-Class-Incremental (AVE-CI), Kinetics-Sounds-Class-Incremental (K-S-CI), and VGGSound100-Class-Incremental (VS100-CI) based on the AVE, Kinetics-Sounds, and VGGSound datasets, respectively. Our experiments on AVE-CI, K-S-CI, and VS100-CI demonstrate that AV-CIL significantly outperforms existing class-incremental learning methods in audio-visual class-incremental learning. Code and data are available at: https://github.com/weiguoPian/AV-CIL_ICCV2023.

Cite

Text

Pian et al. "Audio-Visual Class-Incremental Learning." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00717

Markdown

[Pian et al. "Audio-Visual Class-Incremental Learning." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/pian2023iccv-audiovisual/) doi:10.1109/ICCV51070.2023.00717

BibTeX

@inproceedings{pian2023iccv-audiovisual,
  title     = {{Audio-Visual Class-Incremental Learning}},
  author    = {Pian, Weiguo and Mo, Shentong and Guo, Yunhui and Tian, Yapeng},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {7799-7811},
  doi       = {10.1109/ICCV51070.2023.00717},
  url       = {https://mlanthology.org/iccv/2023/pian2023iccv-audiovisual/}
}