Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-Based Voice Conversion

Rong, Yan; Liu, Li

doi:10.1609/AAAI.V39I23.34694

Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-Based Voice Conversion

Yan Rong, Li Liu

AAAI 2025 pp. 25092-25100

doi:10.1609/AAAI.V39I23.34694 /aaai/2025/rong2025aaai-seeing/

Abstract

Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style. Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the speaker's voice identity information, and (2) inadequacy in decoupling content and speaker identity information from the audio input. To address these issues, we present a novel FVC method, Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC), which overcomes the above two limitations. More precisely, we propose an Identity-Aware Query-based Contrastive Learning (IAQ-CL) module to extract speaker-specific facial features, and a Mutual Information-based Dual Decoupling (MIDD) module to purify content features from audio, ensuring clear and high-quality voice conversion. Besides, unlike prior works, our method can accept either audio or text inputs, offering controllable speech generation with adjustable emotional tone and speed. Extensive experiments demonstrate that ID-FaceVC achieves state-of-the-art performance across various metrics, with qualitative and user study results confirming its effectiveness in naturalness, similarity, and diversity.

PDF AAAI Semantic Scholar

Cite

Text

Rong and Liu. "Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-Based Voice Conversion." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I23.34694

Markdown

[Rong and Liu. "Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-Based Voice Conversion." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/rong2025aaai-seeing/) doi:10.1609/AAAI.V39I23.34694

BibTeX

@inproceedings{rong2025aaai-seeing,
  title     = {{Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-Based Voice Conversion}},
  author    = {Rong, Yan and Liu, Li},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {25092-25100},
  doi       = {10.1609/AAAI.V39I23.34694},
  url       = {https://mlanthology.org/aaai/2025/rong2025aaai-seeing/}
}