ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering

Song, Yakun; Chen, Zhuo; Wang, Xiaofei; Ma, Ziyang; Chen, Xie

doi:10.1609/AAAI.V39I24.34703

ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering

Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Xie Chen

AAAI 2025 pp. 25174-25182

doi:10.1609/AAAI.V39I24.34703 /aaai/2025/song2025aaai-ella/

Abstract

The language model (LM) approach based on acoustic and linguistic prompts, such as VALL-E, has achieved remarkable progress in the field of zero-shot audio generation. However, existing methods still have some limitations: 1) repetitions, transpositions, and omissions in the output synthesized speech due to limited alignment constraints between audio and phoneme tokens; 2) challenges of fine-grained control over the synthesized speech with autoregressive (AR) language model; 3) infinite silence generation due to the nature of AR-based decoding, especially under the greedy strategy. To alleviate these issues, we propose ELLA-V, a simple but efficient LM-based zero-shot text-to-speech (TTS) framework, which enables fine-grained control over synthesized audio at the phoneme level. The key to ELLA-V is interleaving sequences of acoustic and phoneme tokens, where phoneme tokens appear ahead of the corresponding acoustic tokens. The experimental findings reveal that our model outperforms baselines in terms of accuracy and delivers more stable results using both greedy and sampling-based decoding strategies.

PDF AAAI Semantic Scholar

Cite

Text

Song et al. "ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I24.34703

Markdown

[Song et al. "ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/song2025aaai-ella/) doi:10.1609/AAAI.V39I24.34703

BibTeX

@inproceedings{song2025aaai-ella,
  title     = {{ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering}},
  author    = {Song, Yakun and Chen, Zhuo and Wang, Xiaofei and Ma, Ziyang and Chen, Xie},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {25174-25182},
  doi       = {10.1609/AAAI.V39I24.34703},
  url       = {https://mlanthology.org/aaai/2025/song2025aaai-ella/}
}