A Two-Stream Convolution Architecture for ESC Based on Audio Feature Distanglement
Abstract
ESC (Environmental Sound Classification) is an active area of research in the field of audio classification that has made significant progress in recent years. The current mainstream ESC methods are based on increasing the dimension of the extracted audio features and therefore draw on the two-dimensional convolution methods used in image processing. However, two-dimensional convolution is expensive to train and the complexity of the corresponding model is usually very high. In response to these issues, we propose a novel two-stream neural network model by the idea of disentanglement, which uses onedimensional convolution for feature extraction to disentangle the audio features into the time and frequency domains separately. Our approach balances computational pressure with classification accuracy well. The accuracy of our approach on the Urbansound 8k and Esc-10 datasets was 98.51% and 97.50%, respectively, which exceeds that of most models. Meanwhile, the model complexity is also lower.
Cite
Text
Chang et al. "A Two-Stream Convolution Architecture for ESC Based on Audio Feature Distanglement." Proceedings of The 14th Asian Conference on Machine Learning, 2022.Markdown
[Chang et al. "A Two-Stream Convolution Architecture for ESC Based on Audio Feature Distanglement." Proceedings of The 14th Asian Conference on Machine Learning, 2022.](https://mlanthology.org/acml/2022/chang2022acml-twostream/)BibTeX
@inproceedings{chang2022acml-twostream,
title = {{A Two-Stream Convolution Architecture for ESC Based on Audio Feature Distanglement}},
author = {Chang, Zhenghao and He, Ruhan and Yu, Yongsheng and Zhang, Zili and Bai, GeLi},
booktitle = {Proceedings of The 14th Asian Conference on Machine Learning},
year = {2022},
pages = {153-168},
volume = {189},
url = {https://mlanthology.org/acml/2022/chang2022acml-twostream/}
}