Can Audio-Visual Integration Strengthen Robustness Under Multimodal Attacks?
Abstract
In this paper, we propose to make a systematic study on machines' multisensory perception under attacks. We use the audio-visual event recognition task against multimodal adversarial attacks as a proxy to investigate the robustness of audio-visual learning. We attack audio, visual, and both modalities to explore whether audio-visual integration still strengthens perception and how different fusion mechanisms affect the robustness of audio-visual models. For interpreting the multimodal interactions under attacks, we learn a weakly-supervised sound source visual localization model to localize sounding regions in videos. To mitigate multimodal attacks, we propose an audio-visual defense approach based on an audio-visual dissimilarity constraint and external feature memory banks. Extensive experiments demonstrate that audio-visual models are susceptible to multimodal adversarial attacks; audio-visual integration could decrease the model robustness rather than strengthen under multimodal attacks; even a weakly-supervised sound source visual localization model can be successfully fooled; our defense method can improve the invulnerability of audio-visual networks without significantly sacrificing clean model performance.
Cite
Text
Tian and Xu. "Can Audio-Visual Integration Strengthen Robustness Under Multimodal Attacks?." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.00555Markdown
[Tian and Xu. "Can Audio-Visual Integration Strengthen Robustness Under Multimodal Attacks?." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/tian2021cvpr-audiovisual/) doi:10.1109/CVPR46437.2021.00555BibTeX
@inproceedings{tian2021cvpr-audiovisual,
title = {{Can Audio-Visual Integration Strengthen Robustness Under Multimodal Attacks?}},
author = {Tian, Yapeng and Xu, Chenliang},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2021},
pages = {5601-5611},
doi = {10.1109/CVPR46437.2021.00555},
url = {https://mlanthology.org/cvpr/2021/tian2021cvpr-audiovisual/}
}