Fugatto 1: Foundational Generative Audio Transformer Opus 1

Abstract

Fugatto is a versatile audio synthesis and transformation model capable of following free-form text instructions with optional audio inputs. While large language models (LLMs) trained with text on a simple next-token prediction objective can learn to infer instructions directly from the data, models trained solely on audio data lack this capacity. This is because audio data does not inherently contain the instructions that were used to generate it. To overcome this challenge, we introduce a specialized dataset generation approach optimized for producing a wide range of audio generation and transformation tasks, ensuring the data reveals meaningful relationships between audio and language. Another challenge lies in achieving compositional abilities -- such as combining, interpolating between, or negating instructions -- using data alone. To address it, we propose ComposableART, an inference-time technique that extends classifier-free guidance to compositional guidance. It enables the seamless and flexible composition of instructions, leading to highly customizable audio outputs outside the training distribution. Our evaluations across a diverse set of tasks demonstrate that Fugatto performs competitively with specialized models, while ComposableART enhances its sonic palette and control over synthesis. Most notably, we highlight our framework's ability to execute emergent sounds and tasks -- sonic phenomena that transcend conventional audio generation -- unlocking new creative possibilities. \href{https://fugatto.github.io/}Demo Website.

Cite

Text

Valle et al. "Fugatto 1: Foundational Generative Audio Transformer Opus 1." International Conference on Learning Representations, 2025.

Markdown

[Valle et al. "Fugatto 1: Foundational Generative Audio Transformer Opus 1." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/valle2025iclr-fugatto/)

BibTeX

@inproceedings{valle2025iclr-fugatto,
  title     = {{Fugatto 1: Foundational Generative Audio Transformer Opus 1}},
  author    = {Valle, Rafael and Badlani, Rohan and Kong, Zhifeng and Lee, Sang-gil and Goel, Arushi and Kim, Sungwon and Santos, Joao Felipe and Dai, Shuqi and Gururani, Siddharth and Aljafari, Aya and Liu, Alexander H. and Shih, Kevin J. and Prenger, Ryan and Ping, Wei and Yang, Chao-Han Huck and Catanzaro, Bryan},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/valle2025iclr-fugatto/}
}