IMPACT: Iterative Mask-Based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling

Huang, Kuan-Po; Yang, Shu-Wen; Phan, Huy; Lu, Bo-Ru; Kim, Byeonggeun; Macha, Sashank; Tang, Qingming; Ghosh, Shalini; Lee, Hung-Yi; Kao, Chieh-Chi; Wang, Chao

IMPACT: Iterative Mask-Based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling

Kuan-Po Huang, Shu-Wen Yang, Huy Phan, Bo-Ru Lu, Byeonggeun Kim, Sashank Macha, Qingming Tang, Shalini Ghosh, Hung-Yi Lee, Chieh-Chi Kao, Chao Wang

ICML 2025 pp. 26002-26019

/icml/2025/huang2025icml-impact/

Abstract

Text-to-audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion-based frameworks, including the Tango and the AudioLDM series, represent the state-of-the-art in text-to-audio generation. Despite achieving high audio fidelity, they incur significant inference latency due to the slow diffusion sampling process. MAGNET, a mask-based model operating on discrete tokens, addresses slow inference through iterative mask-based parallel decoding. However, its audio quality still lags behind that of diffusion-based models. In this work, we introduce IMPACT, a text-to-audio generation framework that achieves high performance in audio quality and fidelity while ensuring fast inference. IMPACT utilizes iterative mask-based parallel decoding in a continuous latent space powered by diffusion modeling. This approach eliminates the fidelity constraints of discrete tokens while maintaining competitive inference speed. Results on AudioCaps demonstrate that IMPACT achieves state-of-the-art performance on key metrics including Fréchet Distance (FD) and Fréchet Audio Distance (FAD) while significantly reducing latency compared to prior models. The project website is available at https://audio-impact.github.io/.

PDF ICML OpenReview Semantic Scholar

Cite

Text

Huang et al. "IMPACT: Iterative Mask-Based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Huang et al. "IMPACT: Iterative Mask-Based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/huang2025icml-impact/)

BibTeX

@inproceedings{huang2025icml-impact,
  title     = {{IMPACT: Iterative Mask-Based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling}},
  author    = {Huang, Kuan-Po and Yang, Shu-Wen and Phan, Huy and Lu, Bo-Ru and Kim, Byeonggeun and Macha, Sashank and Tang, Qingming and Ghosh, Shalini and Lee, Hung-Yi and Kao, Chieh-Chi and Wang, Chao},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {26002-26019},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/huang2025icml-impact/}
}