AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
Abstract
Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR). We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination of labeled and unlabeled videos with continuously regenerated pseudo-labels. Our models are trained for speech recognition from audio-visual inputs and can perform speech recognition using both audio and visual modalities, or one modality. Our method uses the same audio-visual model for both supervised training and pseudo-label generation, mitigating the need for external speech recognition models to generate pseudo-labels. AV-CPL obtains significant improvements in VSR performance on the LRS3 dataset while maintaining practical ASR and AVSR performance.
Cite
Text
Rouditchenko et al. "AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-93806-1_18Markdown
[Rouditchenko et al. "AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/rouditchenko2024eccvw-avcpl/) doi:10.1007/978-3-031-93806-1_18BibTeX
@inproceedings{rouditchenko2024eccvw-avcpl,
title = {{AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition}},
author = {Rouditchenko, Andrew and Collobert, Ronan and Likhomanenko, Tatiana},
booktitle = {European Conference on Computer Vision Workshops},
year = {2024},
pages = {238-249},
doi = {10.1007/978-3-031-93806-1_18},
url = {https://mlanthology.org/eccvw/2024/rouditchenko2024eccvw-avcpl/}
}