Audio Transformer for Synthetic Speech Detection via Multi-Formant Analysis

Cuccovillo, Luca; Gerhardt, Milica; Aichroth, Patrick

doi:10.1109/CVPRW63382.2024.00444

Audio Transformer for Synthetic Speech Detection via Multi-Formant Analysis

Luca Cuccovillo, Milica Gerhardt, Patrick Aichroth

CVPRW 2024 pp. 4409-4417

doi:10.1109/CVPRW63382.2024.00444 /cvprw/2024/cuccovillo2024cvprw-audio/

Abstract

This paper introduces a novel multi-task transformer for detecting synthetic speech. The network encodes magnitude and phase of the input speech with a feature bottleneck, used to autoencode the input magnitude, to predict the trajectory of the first phonetic formants (F0, F1, F2), and to distinguish whether the input speech is synthetic or natural. The approach achieves state-of-the-art performance on the ASVspoof 2019 LA dataset with an AUC score of 0.932, while ensuring interpretability at the same time.

PDF CVPRW Semantic Scholar

Cite

Text

Cuccovillo et al. "Audio Transformer for Synthetic Speech Detection via Multi-Formant Analysis." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00444

Markdown

[Cuccovillo et al. "Audio Transformer for Synthetic Speech Detection via Multi-Formant Analysis." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/cuccovillo2024cvprw-audio/) doi:10.1109/CVPRW63382.2024.00444

BibTeX

@inproceedings{cuccovillo2024cvprw-audio,
  title     = {{Audio Transformer for Synthetic Speech Detection via Multi-Formant Analysis}},
  author    = {Cuccovillo, Luca and Gerhardt, Milica and Aichroth, Patrick},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {4409-4417},
  doi       = {10.1109/CVPRW63382.2024.00444},
  url       = {https://mlanthology.org/cvprw/2024/cuccovillo2024cvprw-audio/}
}