Cross-Modal Distillation for Speaker Recognition
Abstract
Speaker recognition achieved great progress recently, however, it is not easy or efficient to further improve its performance via traditional solutions: collecting more data and designing new neural networks. Aiming at the fundamental challenge of speech data, i.e. low information density, multimodal learning can mitigate this challenge by introducing richer and more discriminative information as input for identity recognition. Specifically, since the face image is more discriminative than the speech for identity recognition, we conduct multimodal learning by introducing a face recognition model (teacher) to transfer discriminative knowledge to a speaker recognition model (student) during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between face and speech can easily lead to overfitting. In this work, we introduce a multimodal learning framework, VGSR (Vision-Guided Speaker Recognition). Specifically, we propose a MKD (Margin-based Knowledge Distillation) strategy for cross-modality distillation by introducing a loose constrain to align the teacher and student, greatly reducing overfitting. Our MKD strategy can easily adapt to various existing knowledge distillation methods. In addition, we propose a QAW (Quality-based Adaptive Weights) module to weight input samples via quantified data quality, leading to a robust model training. Experimental results on the VoxCeleb1 and CN-Celeb datasets show our proposed strategies can effectively improve the accuracy of speaker recognition by a margin of 10% ∼ 15%, and our methods are very robust to different noises.
Cite
Text
Jin et al. "Cross-Modal Distillation for Speaker Recognition." AAAI Conference on Artificial Intelligence, 2023. doi:10.1609/AAAI.V37I11.26525Markdown
[Jin et al. "Cross-Modal Distillation for Speaker Recognition." AAAI Conference on Artificial Intelligence, 2023.](https://mlanthology.org/aaai/2023/jin2023aaai-cross/) doi:10.1609/AAAI.V37I11.26525BibTeX
@inproceedings{jin2023aaai-cross,
title = {{Cross-Modal Distillation for Speaker Recognition}},
author = {Jin, Yufeng and Hu, Guosheng and Chen, Haonan and Miao, Duoqian and Hu, Liang and Zhao, Cairong},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2023},
pages = {12977-12985},
doi = {10.1609/AAAI.V37I11.26525},
url = {https://mlanthology.org/aaai/2023/jin2023aaai-cross/}
}