Learning from the Master: Distilling Cross-Modal Advanced Knowledge for Lip Reading
Abstract
Lip reading aims to predict the spoken sentences from silent lip videos. Due to the fact that such a vision task usually performs worse than its counterpart speech recognition, one potential scheme is to distill knowledge from a teacher pretrained by audio signals. However, the latent domain gap between the cross-modal data could lead to an learning ambiguity and thus limits the performance of lip reading. In this paper, we propose a novel collaborative framework for lip reading, and two aspects of issues are considered: 1) the teacher should understand bi-modal knowledge to possibly bridge the inherent cross-modal gap; 2) the teacher should adjust teaching contents adaptively with the evolution of the student. To these ends, we introduce a trainable "master" network which ingests both audio signals and silent lip videos instead of a pretrained teacher. The master produces logits from three modalities of features: audio modality, video modality, and their combination. To further provide an interactive strategy to fuse these knowledge organically, we regularize the master with the task-specific feedback from the student, in which the requirement of the student is implicitly embedded. Meanwhile we involve a couple of "tutor" networks into our system as guidance for emphasizing the fruitful knowledge flexibly. In addition, we incorporate a curriculum learning design to ensure a better convergence. Extensive experiments demonstrate that the proposed network outperforms the state-of-the-art methods on several benchmarks, including in both word-level and sentence-level scenarios.
Cite
Text
Ren et al. "Learning from the Master: Distilling Cross-Modal Advanced Knowledge for Lip Reading." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.01312Markdown
[Ren et al. "Learning from the Master: Distilling Cross-Modal Advanced Knowledge for Lip Reading." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/ren2021cvpr-learning/) doi:10.1109/CVPR46437.2021.01312BibTeX
@inproceedings{ren2021cvpr-learning,
title = {{Learning from the Master: Distilling Cross-Modal Advanced Knowledge for Lip Reading}},
author = {Ren, Sucheng and Du, Yong and Lv, Jianming and Han, Guoqiang and He, Shengfeng},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2021},
pages = {13325-13333},
doi = {10.1109/CVPR46437.2021.01312},
url = {https://mlanthology.org/cvpr/2021/ren2021cvpr-learning/}
}