Presto! Distilling Steps and Layers for Accelerating Music Generation

Abstract

Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than the comparable SOTA model) — the fastest TTM to our knowledge.

Cite

Text

Novack et al. "Presto! Distilling Steps and Layers for Accelerating Music Generation." International Conference on Learning Representations, 2025.

Markdown

[Novack et al. "Presto! Distilling Steps and Layers for Accelerating Music Generation." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/novack2025iclr-presto/)

BibTeX

@inproceedings{novack2025iclr-presto,
  title     = {{Presto! Distilling Steps and Layers for Accelerating Music Generation}},
  author    = {Novack, Zachary and Zhu, Ge and Casebeer, Jonah and McAuley, Julian and Berg-Kirkpatrick, Taylor and Bryan, Nicholas J.},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/novack2025iclr-presto/}
}