GLAVNet: Global-Local Audio-Visual Cues for Fine-Grained Material Recognition

Abstract

In this paper, we aim to recognize materials with combined use of auditory and visual perception. To this end, we construct a new dataset named GLAudio that consists of both the geometry of the object being struck and the sound captured from either modal sound synthesis (for virtual objects) or real measurements (for real objects). Besides global geometries, our dataset also takes local geometries around different hitpoints into consideration. This local information is less explored in existing datasets. We demonstrate that local geometry has a greater impact on the sound than the global geometry and offers more cues in material recognition. To extract features from different modalities and perform proper fusion, we propose a new deep neural network GLAVNet that comprises multiple branches and a well-designed fusion module. Once trained on GLAudio, our GLAVNet provides state-of-the-art performance on material identification and supports fine-grained material categorization.

Cite

Text

Shi et al. "GLAVNet: Global-Local Audio-Visual Cues for Fine-Grained Material Recognition." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.01420

Markdown

[Shi et al. "GLAVNet: Global-Local Audio-Visual Cues for Fine-Grained Material Recognition." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/shi2021cvpr-glavnet/) doi:10.1109/CVPR46437.2021.01420

BibTeX

@inproceedings{shi2021cvpr-glavnet,
  title     = {{GLAVNet: Global-Local Audio-Visual Cues for Fine-Grained Material Recognition}},
  author    = {Shi, Fengmin and Guo, Jie and Zhang, Haonan and Yang, Shan and Wang, Xiying and Guo, Yanwen},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {14433-14442},
  doi       = {10.1109/CVPR46437.2021.01420},
  url       = {https://mlanthology.org/cvpr/2021/shi2021cvpr-glavnet/}
}