Lipreading by Neural Networks: Visual Preprocessing, Learning, and Sensory Integration
Abstract
We have developed visual preprocessing algorithms for extracting phonologically relevant features from the grayscale video image of a speaker, to provide speaker-independent inputs for an automat(cid:173) ic lipreading ("speechreading") system. Visual features such as mouth open/closed, tongue visible/not-visible, teeth visible/not(cid:173) visible, and several shape descriptors of the mouth and its motion are all rapidly computable in a manner quite insensitive to lighting conditions. We formed a hybrid speechreading system consisting of two time delay neural networks (video and acoustic) and inte(cid:173) grated their responses by means of independent opinion pooling - the Bayesian optimal method given conditional independence, which seems to hold for our data. This hybrid system had an er(cid:173) ror rate 25% lower than that of the acoustic subsystem alone on a five-utterance speaker-independent task, indicating that video can be used to improve speech recognition.
Cite
Text
Wolff et al. "Lipreading by Neural Networks: Visual Preprocessing, Learning, and Sensory Integration." Neural Information Processing Systems, 1993.Markdown
[Wolff et al. "Lipreading by Neural Networks: Visual Preprocessing, Learning, and Sensory Integration." Neural Information Processing Systems, 1993.](https://mlanthology.org/neurips/1993/wolff1993neurips-lipreading/)BibTeX
@inproceedings{wolff1993neurips-lipreading,
title = {{Lipreading by Neural Networks: Visual Preprocessing, Learning, and Sensory Integration}},
author = {Wolff, Gregory J. and Prasad, K. Venkatesh and Stork, David G. and Hennecke, Marcus},
booktitle = {Neural Information Processing Systems},
year = {1993},
pages = {1027-1034},
url = {https://mlanthology.org/neurips/1993/wolff1993neurips-lipreading/}
}