Audio-Visual Speech Fusion Using Coupled Hidden Markov Models

Abstract

The fusion of audio and visual speech is an instance of the general sensory fusion problem. The sensory fusion problem arises in the situation when multiple channels carry complementary information about different components of a system. In the case of audio-visual speech, the two modalities manifest two aspects of the same underlying speech production process. From an observer's view, the audio channel and the visual channel represent two interacting stochastic processes. We seek a framework that can model the two individual processes as well as their dynamic interactions. One interesting aspect of audio-visual speech is the inherent asynchrony between the audio and visual channels. Most early integration approaches to the fusion problem assume tight synchrony between the two. However, studies have shown that human perception of bimodal speech does not require rigid synchronization of the two modalities. Furthermore, humans appear to use the audio-visual asynchronies as multimodal features. For example, it is well known that the voice onset time is an important cue to the voicing feature in stop consonants. This information can be conveyed bimodally by the interval between seeing the stop release and hearing the vocal cord vibration. Therefore, a successful fusion scheme should not only be tolerant to asynchrony between the audio and visual cues, but also be apt to capture and exploit this bimodal feature.

Cite

Text

Chu and Huang. "Audio-Visual Speech Fusion Using Coupled Hidden Markov Models." IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2007. doi:10.1109/CVPR.2007.383524

Markdown

[Chu and Huang. "Audio-Visual Speech Fusion Using Coupled Hidden Markov Models." IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2007.](https://mlanthology.org/cvpr/2007/chu2007cvpr-audio/) doi:10.1109/CVPR.2007.383524

BibTeX

@inproceedings{chu2007cvpr-audio,
  title     = {{Audio-Visual Speech Fusion Using Coupled Hidden Markov Models}},
  author    = {Chu, Stephen M. and Huang, Thomas S.},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2007},
  doi       = {10.1109/CVPR.2007.383524},
  url       = {https://mlanthology.org/cvpr/2007/chu2007cvpr-audio/}
}