Joint Time-Frequency and Time Domain Learning for Speech Enhancement

Abstract

For single-channel speech enhancement, both time-domain and time-frequency-domain methods have their respective pros and cons. In this paper, we present a cross-domain framework named TFT-Net, which takes time-frequency spectrogram as input and produces time-domain waveform as output. Such a framework takes advantage of the knowledge we have about spectrogram and avoids some of the drawbacks that T-F-domain methods have been suffering from. In TFT-Net, we design an innovative dual-path attention block (DAB) to fully exploit correlations along the time and frequency axes. We further discover that a sample-independent DAB (SDAB) achieves a good tradeoff between enhanced speech quality and complexity. Ablation studies show that both the cross-domain design and the SDAB block bring large performance gain. When logarithmic MSE is used as the training criteria, TFT-Net achieves the highest SDR and SSNR among state-of-the-art methods on two major speech enhancement benchmarks.

Cite

Text

Tang et al. "Joint Time-Frequency and Time Domain Learning for Speech Enhancement." International Joint Conference on Artificial Intelligence, 2020. doi:10.24963/IJCAI.2020/528

Markdown

[Tang et al. "Joint Time-Frequency and Time Domain Learning for Speech Enhancement." International Joint Conference on Artificial Intelligence, 2020.](https://mlanthology.org/ijcai/2020/tang2020ijcai-joint/) doi:10.24963/IJCAI.2020/528

BibTeX

@inproceedings{tang2020ijcai-joint,
  title     = {{Joint Time-Frequency and Time Domain Learning for Speech Enhancement}},
  author    = {Tang, Chuanxin and Luo, Chong and Zhao, Zhiyuan and Xie, Wenxuan and Zeng, Wenjun},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2020},
  pages     = {3816-3822},
  doi       = {10.24963/IJCAI.2020/528},
  url       = {https://mlanthology.org/ijcai/2020/tang2020ijcai-joint/}
}