Audio Generation with Multiple Conditional Diffusion Model

Guo, Zhifang; Mao, Jianguo; Tao, Rui; Yan, Long; Ouchi, Kazushige; Liu, Hong; Wang, Xiangdong

doi:10.1609/AAAI.V38I16.29773

Audio Generation with Multiple Conditional Diffusion Model

Zhifang Guo, Jianguo Mao, Rui Tao, Long Yan, Kazushige Ouchi, Hong Liu, Xiangdong Wang

AAAI 2024 pp. 18153-18161

doi:10.1609/AAAI.V38I16.29773 /aaai/2024/guo2024aaai-audio/

Abstract

Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation.

PDF AAAI Semantic Scholar

Cite

Text

Guo et al. "Audio Generation with Multiple Conditional Diffusion Model." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I16.29773

Markdown

[Guo et al. "Audio Generation with Multiple Conditional Diffusion Model." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/guo2024aaai-audio/) doi:10.1609/AAAI.V38I16.29773

BibTeX

@inproceedings{guo2024aaai-audio,
  title     = {{Audio Generation with Multiple Conditional Diffusion Model}},
  author    = {Guo, Zhifang and Mao, Jianguo and Tao, Rui and Yan, Long and Ouchi, Kazushige and Liu, Hong and Wang, Xiangdong},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {18153-18161},
  doi       = {10.1609/AAAI.V38I16.29773},
  url       = {https://mlanthology.org/aaai/2024/guo2024aaai-audio/}
}