Learning Joint Statistical Models for Audio-Visual Fusion and Segregation
Abstract
People can understand complex auditory and visual information, often using one to disambiguate the other. Automated analysis, even at a low(cid:173) level, faces severe challenges, including the lack of accurate statistical models for the signals, and their high-dimensionality and varied sam(cid:173) pling rates. Previous approaches [6] assumed simple parametric models for the joint distribution which, while tractable, cannot capture the com(cid:173) plex signal relationships. We learn the joint distribution of the visual and auditory signals using a non-parametric approach. First, we project the data into a maximally informative, low-dimensional subspace, suitable for density estimation. We then model the complicated stochastic rela(cid:173) tionships between the signals using a nonparametric density estimator. These learned densities allow processing across signal modalities. We demonstrate, on synthetic and real signals, localization in video of the face that is speaking in audio, and, conversely, audio enhancement of a particular speaker selected from the video.
Cite
Text
Iii et al. "Learning Joint Statistical Models for Audio-Visual Fusion and Segregation." Neural Information Processing Systems, 2000.Markdown
[Iii et al. "Learning Joint Statistical Models for Audio-Visual Fusion and Segregation." Neural Information Processing Systems, 2000.](https://mlanthology.org/neurips/2000/iii2000neurips-learning/)BibTeX
@inproceedings{iii2000neurips-learning,
title = {{Learning Joint Statistical Models for Audio-Visual Fusion and Segregation}},
author = {Iii, John W. Fisher and Darrell, Trevor and Freeman, William T. and Viola, Paul A.},
booktitle = {Neural Information Processing Systems},
year = {2000},
pages = {772-778},
url = {https://mlanthology.org/neurips/2000/iii2000neurips-learning/}
}