DITTO: Diffusion Inference-Time T-Optimization for Music Generation

Abstract

We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose framework for controlling pre-trained text-to-music diffusion models at inference-time via optimizing initial noise latents. Our method can be used to optimize through any differentiable feature matching loss to achieve a target (stylized) output and leverages gradient checkpointing for memory efficiency. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control – all without ever fine-tuning the underlying model. When we compare our approach against related training, guidance, and optimization-based methods, we find DITTO achieves state-of-the-art performance on nearly all tasks, including outperforming comparable approaches on controllability, audio quality, and computational efficiency, thus opening the door for high-quality, flexible, training-free control of diffusion models. Sound examples can be found at https://ditto-music.github.io/web/.

Cite

Text

Novack et al. "DITTO: Diffusion Inference-Time T-Optimization for Music Generation." International Conference on Machine Learning, 2024.

Markdown

[Novack et al. "DITTO: Diffusion Inference-Time T-Optimization for Music Generation." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/novack2024icml-ditto/)

BibTeX

@inproceedings{novack2024icml-ditto,
  title     = {{DITTO: Diffusion Inference-Time T-Optimization for Music Generation}},
  author    = {Novack, Zachary and Mcauley, Julian and Berg-Kirkpatrick, Taylor and Bryan, Nicholas J.},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {38426-38447},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/novack2024icml-ditto/}
}