Neural Synthesis of Binaural Speech from Mono Audio

Abstract

We present a neural rendering approach for binaural sound synthesis that can produce realistic and spatially accurate binaural sound in realtime. The network takes, as input, a single-channel audio source and synthesizes, as output, two-channel binaural sound, conditioned on the relative position and orientation of the listener with respect to the source. We investigate deficiencies of the l2-loss on raw waveforms in a theoretical analysis and introduce an improved loss that overcomes these limitations. In an empirical evaluation, we establish that our approach is the first to generate spatially accurate waveform outputs (as measured by real recordings) and outperforms existing approaches by a considerable margin, both quantitatively and in a perceptual study. Dataset and code are available online.

Cite

Text

Richard et al. "Neural Synthesis of Binaural Speech from Mono Audio." International Conference on Learning Representations, 2021.

Markdown

[Richard et al. "Neural Synthesis of Binaural Speech from Mono Audio." International Conference on Learning Representations, 2021.](https://mlanthology.org/iclr/2021/richard2021iclr-neural/)

BibTeX

@inproceedings{richard2021iclr-neural,
  title     = {{Neural Synthesis of Binaural Speech from Mono Audio}},
  author    = {Richard, Alexander and Markovic, Dejan and Gebru, Israel D. and Krenn, Steven and Butler, Gladstone Alexander and Torre, Fernando and Sheikh, Yaser},
  booktitle = {International Conference on Learning Representations},
  year      = {2021},
  url       = {https://mlanthology.org/iclr/2021/richard2021iclr-neural/}
}