UALM: Unified Audio Language Model for Understanding, Generation and Reasoning
Abstract
Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks -- an essential step toward advanced multimodal reasoning. This paper introduces Unified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations.
Cite
Text
Tian et al. "UALM: Unified Audio Language Model for Understanding, Generation and Reasoning." International Conference on Learning Representations, 2026.Markdown
[Tian et al. "UALM: Unified Audio Language Model for Understanding, Generation and Reasoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/tian2026iclr-ualm/)BibTeX
@inproceedings{tian2026iclr-ualm,
title = {{UALM: Unified Audio Language Model for Understanding, Generation and Reasoning}},
author = {Tian, Jinchuan and Lee, Sang-gil and Kong, Zhifeng and Ghosh, Sreyan and Goel, Arushi and Yang, Chao-Han Huck and Dai, Wenliang and Liu, Zihan and Ye, Hanrong and Watanabe, Shinji and Shoeybi, Mohammad and Catanzaro, Bryan and Valle, Rafael and Ping, Wei},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/tian2026iclr-ualm/}
}