Visually Indicated Sound Generation by Perceptually Optimized Classification

Chen, Kan; Zhang, Chuanxi; Fang, Chen; Wang, Zhaowen; Bui, Trung; Nevatia, Ram

doi:10.1007/978-3-030-11024-6_43

Visually Indicated Sound Generation by Perceptually Optimized Classification

Kan Chen, Chuanxi Zhang, Chen Fang, Zhaowen Wang, Trung Bui, Ram Nevatia

ECCVW 2018 pp. 560-574

doi:10.1007/978-3-030-11024-6_43 /eccvw/2018/chen2018eccvw-visually/

Abstract

Visually indicated sound generation aims to predict visually consistent sound from the video content. Previous methods addressed this problem by creating a single generative model that ignores the distinctive characteristics of various sound categories. Nowadays, state-of-the-art sound classification networks are available to capture semantic-level information in audio modality, which can also serve for the purpose of visually indicated sound generation. In this paper, we explore generating fine-grained sound from a variety of sound classes, and leverage pre-trained sound classification networks to improve the audio generation quality. We propose a novel Perceptually Optimized Classification based Audio generation Network (POCAN), which generates sound conditioned on the sound class predicted from visual information. Additionally, a perceptual loss is calculated via a pre-trained sound classification network to align the semantic information between the generated sound and its ground truth during training. Experiments show that POCAN achieves significantly better results in visually indicated sound generation task on two datasets.

PDF ECCVW Semantic Scholar

Cite

Text

Chen et al. "Visually Indicated Sound Generation by Perceptually Optimized Classification." European Conference on Computer Vision Workshops, 2018. doi:10.1007/978-3-030-11024-6_43

Markdown

[Chen et al. "Visually Indicated Sound Generation by Perceptually Optimized Classification." European Conference on Computer Vision Workshops, 2018.](https://mlanthology.org/eccvw/2018/chen2018eccvw-visually/) doi:10.1007/978-3-030-11024-6_43

BibTeX

@inproceedings{chen2018eccvw-visually,
  title     = {{Visually Indicated Sound Generation by Perceptually Optimized Classification}},
  author    = {Chen, Kan and Zhang, Chuanxi and Fang, Chen and Wang, Zhaowen and Bui, Trung and Nevatia, Ram},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2018},
  pages     = {560-574},
  doi       = {10.1007/978-3-030-11024-6_43},
  url       = {https://mlanthology.org/eccvw/2018/chen2018eccvw-visually/}
}