Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis

Abstract

We propose Shallow Flow Matching (SFM), a novel mechanism that enhances flow matching (FM)-based text-to-speech (TTS) models within a coarse-to-fine generation paradigm. Unlike conventional FM modules, which use the coarse representations from the weak generator as conditions, SFM constructs intermediate states along the FM paths from these representations. During training, we introduce an orthogonal projection method to adaptively determine the temporal position of these states, and apply a principled construction strategy based on a single-segment piecewise flow. The SFM inference starts from the intermediate state rather than pure noise, thereby focusing computation on the latter stages of the FM paths. We integrate SFM into multiple TTS models with a lightweight SFM head. Experiments demonstrate that SFM yields consistent gains in speech naturalness across both objective and subjective evaluations, and significantly accelerates inference when using adaptive-step ODE solvers. Demo and codes are available at https://ydqmkkx.github.io/SFMDemo/.

Cite

Text

Yang et al. "Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis." Advances in Neural Information Processing Systems, 2025.

Markdown

[Yang et al. "Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/yang2025neurips-shallow/)

BibTeX

@inproceedings{yang2025neurips-shallow,
  title     = {{Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis}},
  author    = {Yang, Dong and Cai, Yiyi and Saito, Yuki and Wang, Lixu and Saruwatari, Hiroshi},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/yang2025neurips-shallow/}
}