On Learning Association of Sound Source and Visual Scenes

Abstract

The sight (vision) and hearing (audition) senses are the most important sources that humans use to understand their surroundings. Visual events are typically associated with sounds and they are combined. Naturally, videos and their corresponding sounds also come together in a synchronized way. Given a plenty of video and sound clip pairs, can a machine model learn to associate the sound with visual scene to reveal the sound source location without any supervision in a way similar to human perception to localize sound sources in visual scenes? In this paper, we are interested in exploring whether computational models can learn the spatial correspondence between visual and audio information by leveraging the correlation between visuals and sound based on simply watching and listening to videos in unsupervised way.

Cite

Text

Senocak et al. "On Learning Association of Sound Source and Visual Scenes." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018.

Markdown

[Senocak et al. "On Learning Association of Sound Source and Visual Scenes." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018.](https://mlanthology.org/cvprw/2018/senocak2018cvprw-learning/)

BibTeX

@inproceedings{senocak2018cvprw-learning,
  title     = {{On Learning Association of Sound Source and Visual Scenes}},
  author    = {Senocak, Arda and Oh, Tae-Hyun and Kim, Junsik and Yang, Ming-Hsuan and Kweon, In So},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2018},
  pages     = {2508-2509},
  url       = {https://mlanthology.org/cvprw/2018/senocak2018cvprw-learning/}
}