A Multi-View Fusion Approach for Enhancing Speech Signals via Short-Time Fractional Fourier Transform

Abstract

Deep learning-based speech enhancement (SE) methods focus on reconstructing speech from the time or frequency domain. However, these domains cannot provide enough information to capture the dynamics of non-stationary signals accurately. To enrich information, this work proposes a multi-view fusion SE method (MFSE). Specifically, MFSE extends the representation space of speech to the dynamic domain (also called fractional domain) between the time and frequency domains by using the short-time fractional Fourier transform (STFrFT). Subsequently, we construct inputs as modes of the primary short-time Fourier transform (STFT) spectrum and the auxiliary STFrFT spectrum views and adaptively identify the optimal fractional STFrFT spectrum from the infinitely continuous fractional domain by leveraging the average spectral centroids. The framework extracts potential features through multiple designed convolutional modules and captures the correlation between different speech frequencies through multi-granularity attention. Experimental results show that the proposed method significantly improves performance in several metrics compared to existing single-channel SE methods based on time and frequency domains. Furthermore, the results of its generalizability evaluation show that the multi-view method outperforms the single-view method under a wide range of SNR conditions.

Cite

Text

Jin et al. "A Multi-View Fusion Approach for Enhancing Speech Signals via Short-Time Fractional Fourier Transform." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/613

Markdown

[Jin et al. "A Multi-View Fusion Approach for Enhancing Speech Signals via Short-Time Fractional Fourier Transform." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/jin2025ijcai-multi/) doi:10.24963/IJCAI.2025/613

BibTeX

@inproceedings{jin2025ijcai-multi,
  title     = {{A Multi-View Fusion Approach for Enhancing Speech Signals via Short-Time Fractional Fourier Transform}},
  author    = {Jin, Zikun and Qian, Yuhua and Liang, Xinyan and Geng, Haijun},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {5508-5516},
  doi       = {10.24963/IJCAI.2025/613},
  url       = {https://mlanthology.org/ijcai/2025/jin2025ijcai-multi/}
}