TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization
Abstract
We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in 3.7 seconds on a A40 GPU. A key challenge in aligning TTA models lies in creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We show that the audio preference dataset generated using CRPO outperforms the static alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. https://tangoflux.github.io/ holds the model-generated audio samples for comparison.
Cite
Text
Hung et al. "TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization." International Conference on Learning Representations, 2026.Markdown
[Hung et al. "TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/hung2026iclr-tangoflux/)BibTeX
@inproceedings{hung2026iclr-tangoflux,
title = {{TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization}},
author = {Hung, Chia-Yu and Majumder, Navonil and Kong, Zhifeng and Mehrish, Ambuj and Zadeh, Amir and Li, Chuan and Valle, Rafael and Catanzaro, Bryan and Poria, Soujanya},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/hung2026iclr-tangoflux/}
}