Speaker Detection Using the Timing Structure of Lip Motion and Sound

Abstract

In this paper, we propose a novel approach to speaker detection by an integration of audio-visual information using the cue of timing structure. We first extract feature sequences of lip motion and sound, and segment each of them into temporal intervals. Then, we construct a cross-media timing-structure model of human speech by learning the temporal relations of overlapping intervals. Based on the learned model, we realize speaker detection by evaluating the timing structure of the observed video and audio. Our experimental result shows the effectiveness of using temporal relations of intervals for speaker detection.

Cite

Text

Horii et al. "Speaker Detection Using the Timing Structure of Lip Motion and Sound." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2008. doi:10.1109/CVPRW.2008.4563183

Markdown

[Horii et al. "Speaker Detection Using the Timing Structure of Lip Motion and Sound." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2008.](https://mlanthology.org/cvprw/2008/horii2008cvprw-speaker/) doi:10.1109/CVPRW.2008.4563183

BibTeX

@inproceedings{horii2008cvprw-speaker,
  title     = {{Speaker Detection Using the Timing Structure of Lip Motion and Sound}},
  author    = {Horii, Yu and Kawashima, Hiroaki and Matsuyama, Takashi},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2008},
  pages     = {1-8},
  doi       = {10.1109/CVPRW.2008.4563183},
  url       = {https://mlanthology.org/cvprw/2008/horii2008cvprw-speaker/}
}