Lipreading by Neural Networks: Visual Preprocessing, Learning, and Sensory Integration

Abstract

We have developed visual preprocessing algorithms for extracting phonologically relevant features from the grayscale video image of a speaker, to provide speaker-independent inputs for an automat(cid:173) ic lipreading ("speechreading") system. Visual features such as mouth open/closed, tongue visible/not-visible, teeth visible/not(cid:173) visible, and several shape descriptors of the mouth and its motion are all rapidly computable in a manner quite insensitive to lighting conditions. We formed a hybrid speechreading system consisting of two time delay neural networks (video and acoustic) and inte(cid:173) grated their responses by means of independent opinion pooling - the Bayesian optimal method given conditional independence, which seems to hold for our data. This hybrid system had an er(cid:173) ror rate 25% lower than that of the acoustic subsystem alone on a five-utterance speaker-independent task, indicating that video can be used to improve speech recognition.

Cite

Text

Wolff et al. "Lipreading by Neural Networks: Visual Preprocessing, Learning, and Sensory Integration." Neural Information Processing Systems, 1993.

Markdown

[Wolff et al. "Lipreading by Neural Networks: Visual Preprocessing, Learning, and Sensory Integration." Neural Information Processing Systems, 1993.](https://mlanthology.org/neurips/1993/wolff1993neurips-lipreading/)

BibTeX

@inproceedings{wolff1993neurips-lipreading,
  title     = {{Lipreading by Neural Networks: Visual Preprocessing, Learning, and Sensory Integration}},
  author    = {Wolff, Gregory J. and Prasad, K. Venkatesh and Stork, David G. and Hennecke, Marcus},
  booktitle = {Neural Information Processing Systems},
  year      = {1993},
  pages     = {1027-1034},
  url       = {https://mlanthology.org/neurips/1993/wolff1993neurips-lipreading/}
}