IMPACT: Iterative Mask-Based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling
Abstract
Text-to-audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion-based frameworks, including the Tango and the AudioLDM series, represent the state-of-the-art in text-to-audio generation. Despite achieving high audio fidelity, they incur significant inference latency due to the slow diffusion sampling process. MAGNET, a mask-based model operating on discrete tokens, addresses slow inference through iterative mask-based parallel decoding. However, its audio quality still lags behind that of diffusion-based models. In this work, we introduce IMPACT, a text-to-audio generation framework that achieves high performance in audio quality and fidelity while ensuring fast inference. IMPACT utilizes iterative mask-based parallel decoding in a continuous latent space powered by diffusion modeling. This approach eliminates the fidelity constraints of discrete tokens while maintaining competitive inference speed. Results on AudioCaps demonstrate that IMPACT achieves state-of-the-art performance on key metrics including Fréchet Distance (FD) and Fréchet Audio Distance (FAD) while significantly reducing latency compared to prior models. The project website is available at https://audio-impact.github.io/.
Cite
Text
Huang et al. "IMPACT: Iterative Mask-Based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Huang et al. "IMPACT: Iterative Mask-Based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/huang2025icml-impact/)BibTeX
@inproceedings{huang2025icml-impact,
title = {{IMPACT: Iterative Mask-Based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling}},
author = {Huang, Kuan-Po and Yang, Shu-Wen and Phan, Huy and Lu, Bo-Ru and Kim, Byeonggeun and Macha, Sashank and Tang, Qingming and Ghosh, Shalini and Lee, Hung-Yi and Kao, Chieh-Chi and Wang, Chao},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {26002-26019},
volume = {267},
url = {https://mlanthology.org/icml/2025/huang2025icml-impact/}
}