AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models

Bai, Jisheng; Liu, Haohe; Wang, Mou; Shi, Dongyuan; Wang, Wenwu; Plumbley, Mark D; Gan, Woon-Seng; Chen, Jianfeng

AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models

Jisheng Bai, Haohe Liu, Mou Wang, Dongyuan Shi, Wenwu Wang, Mark D Plumbley, Woon-Seng Gan, Jianfeng Chen

NeurIPSW 2024

/neuripsw/2024/bai2024neuripsw-audiosetcaps/

Abstract

Building large-scale audio-language datasets is crucial yet challenging for training audio-language models, primarily due to its time-consuming and labour-intensive nature. Although large language models (LLMs) have greatly enhanced the efficiency of this process, current LLM-based pipelines for generating audio-text data still lack the capability to incorporate detailed audio information. In this paper, we propose a novel pipeline leveraging large audio-language models to automatically generate large-scale, fine-grained audio captions. Based on this approach, we create AudioSetCaps, a dataset comprising 1.9 million audio-caption pairs derived from recordings in AudioSet. We evaluate AudioSetCaps on two downstream tasks: audio-text retrieval and automated audio captioning. Models trained with AudioSetCaps achieve state-of-the-art performance on both tasks, demonstrating the high quality of the generated captions. Notably, our proposed data-labelling pipeline employs open-source APIs and can run on a consumer-grade GPU. To facilitate further advancements in this field, we have made our code, audio-caption paired data, and pre-trained models on downstream tasks publicly available at https://github.com/JishengBai/AudioSetCaps

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Bai et al. "AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models." NeurIPS 2024 Workshops: Audio_Imagination, 2024.

Markdown

[Bai et al. "AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models." NeurIPS 2024 Workshops: Audio_Imagination, 2024.](https://mlanthology.org/neuripsw/2024/bai2024neuripsw-audiosetcaps/)

BibTeX

@inproceedings{bai2024neuripsw-audiosetcaps,
  title     = {{AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models}},
  author    = {Bai, Jisheng and Liu, Haohe and Wang, Mou and Shi, Dongyuan and Wang, Wenwu and Plumbley, Mark D and Gan, Woon-Seng and Chen, Jianfeng},
  booktitle = {NeurIPS 2024 Workshops: Audio_Imagination},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/bai2024neuripsw-audiosetcaps/}
}