Self-Supervised Generation of Spatial Audio for 360° Video
Abstract
We introduce an approach to convert mono audio recorded by a 360° video camera into spatial audio, a representation of the distribution of sound over the full viewing sphere. Spatial audio is an important component of immersive 360° video viewing, but spatial audio microphones are still rare in current 360° video production. Our system consists of end-to-end trainable neural networks that separate individual sound sources and localize them on the viewing sphere, conditioned on multi-modal analysis from the audio and 360° video frames. We introduce several datasets, including one filmed ourselves, and one collected in-the-wild from YouTube, consisting of 360° videos uploaded with spatial audio. During training, ground truth spatial audio serves as self-supervision and a mixed down mono track forms the input to our network. Using our approach we show that it is possible to infer the spatial localization of sounds based only on a synchronized 360° video and the mono audio track.
Cite
Text
Morgado et al. "Self-Supervised Generation of Spatial Audio for 360° Video." Neural Information Processing Systems, 2018.Markdown
[Morgado et al. "Self-Supervised Generation of Spatial Audio for 360° Video." Neural Information Processing Systems, 2018.](https://mlanthology.org/neurips/2018/morgado2018neurips-selfsupervised/)BibTeX
@inproceedings{morgado2018neurips-selfsupervised,
title = {{Self-Supervised Generation of Spatial Audio for 360° Video}},
author = {Morgado, Pedro and Nvasconcelos, Nuno and Langlois, Timothy and Wang, Oliver},
booktitle = {Neural Information Processing Systems},
year = {2018},
pages = {362-372},
url = {https://mlanthology.org/neurips/2018/morgado2018neurips-selfsupervised/}
}