Visual to Sound: Generating Natural Sound for Videos in the Wild

Zhou, Yipin; Wang, Zhaowen; Fang, Chen; Bui, Trung; Berg, Tamara L.

Visual to Sound: Generating Natural Sound for Videos in the Wild

Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, Tamara L. Berg

CVPRW 2018 pp. 2500-2503

/cvprw/2018/zhou2018cvprw-visual/

Abstract

As two of the five traditional human senses (sight, hearing, taste, smell, and touch), vision and sound are basic sources through which humans understand the world. Often correlated during natural events, these two modalities combine to jointly affect human perception. In this paper, we pose the task of generating sound given visual input. Specifically, we apply learning-based methods to generate raw waveform samples given input video frames. We evaluate our models on a dataset of videos containing a variety of sounds (such as ambient sounds and sounds from people/animals). Our experiments show that the generated sounds are fairly realistic and have good temporal synchronization with the visual inputs.

PDF CVPRW Semantic Scholar

Cite

Text

Zhou et al. "Visual to Sound: Generating Natural Sound for Videos in the Wild." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018.

Markdown

[Zhou et al. "Visual to Sound: Generating Natural Sound for Videos in the Wild." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018.](https://mlanthology.org/cvprw/2018/zhou2018cvprw-visual/)

BibTeX

@inproceedings{zhou2018cvprw-visual,
  title     = {{Visual to Sound: Generating Natural Sound for Videos in the Wild}},
  author    = {Zhou, Yipin and Wang, Zhaowen and Fang, Chen and Bui, Trung and Berg, Tamara L.},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2018},
  pages     = {2500-2503},
  url       = {https://mlanthology.org/cvprw/2018/zhou2018cvprw-visual/}
}