Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

Mu, Zhaoxi; Yang, Xinyu

doi:10.24963/ijcai.2024/709

Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

Zhaoxi Mu, Xinyu Yang

IJCAI 2024 pp. 6415-6423

doi:10.24963/ijcai.2024/709 /ijcai/2024/mu2024ijcai-separate/

Abstract

Linear Discriminant Analysis (LDA) is a classical supervised dimensionality reduction algorithm. However, LDA focuses more on global structure and overly depends on reliable data labels. For data with outliers and nonlinear structures, LDA cannot effectively capture the true structure of the data. Moreover, the subspace dimension learned by LDA must be smaller than cluster number, which limits its practical applications. To address these issues, we propose a novel unsupervised LDA method that combines centerless K-means and LDA. This method eliminates the need to calculate cluster centroids and improves model robustness. By fusing centerless K-means and LDA into a unified framework and deducing the connection between K-means and manifold learning, this method captures the local manifold structure and discriminative structure. Additionally, the dimensionality of the subspace is not restricted. This method not only overcomes the limitations of traditional LDA but also improves the model’s adaptability to complex data. Extensive experiments on seven datasets demonstrate the effectiveness of the proposed method.

PDF IJCAI Semantic Scholar

Cite

Text

Mu and Yang. "Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction." International Joint Conference on Artificial Intelligence, 2024. doi:10.24963/ijcai.2024/709

Markdown

[Mu and Yang. "Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction." International Joint Conference on Artificial Intelligence, 2024.](https://mlanthology.org/ijcai/2024/mu2024ijcai-separate/) doi:10.24963/ijcai.2024/709

BibTeX

@inproceedings{mu2024ijcai-separate,
  title     = {{Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction}},
  author    = {Mu, Zhaoxi and Yang, Xinyu},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {6415-6423},
  doi       = {10.24963/ijcai.2024/709},
  url       = {https://mlanthology.org/ijcai/2024/mu2024ijcai-separate/}
}