BLAP: Bootstrapping Language-Audio Pre-Training for Music Captioning

Abstract

We introduce BLAP, a model capable of generating high-quality captions for music. BLAP is based on the BLIP-2 architecture, leveraging a fine-tuned CLAP audio encoder and a pre-trained Flan-T5 large language model. To achieve effective cross-modal alignment between music and language, BLAP utilizes a Querying Transformer, allowing us to obtain state-of-the-art performance using 6x less data compared to previous models.We provide qualitative examples demonstrating BLAP's ability to produce realistic captions for music, and perform a quantitative evaluation on three datasets.BLAP achieves a relative improvement on FENSE compared to previous models of 3.5\%, 6.5\%, and 7.5\% on the MusicCaps, Song Describer, and YouTube8m-MTC datasets, respectively. We open-source the code and model weights at https://github.com/ETH-DISCO/blap.

Cite

Text

Lanzendörfer et al. "BLAP: Bootstrapping Language-Audio Pre-Training for Music Captioning." NeurIPS 2024 Workshops: Audio_Imagination, 2024.

Markdown

[Lanzendörfer et al. "BLAP: Bootstrapping Language-Audio Pre-Training for Music Captioning." NeurIPS 2024 Workshops: Audio_Imagination, 2024.](https://mlanthology.org/neuripsw/2024/lanzendorfer2024neuripsw-blap/)

BibTeX

@inproceedings{lanzendorfer2024neuripsw-blap,
  title     = {{BLAP: Bootstrapping Language-Audio Pre-Training for Music Captioning}},
  author    = {Lanzendörfer, Luca A and Pinkl, Constantin and Perraudin, Nathanaël and Wattenhofer, Roger},
  booktitle = {NeurIPS 2024 Workshops: Audio_Imagination},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/lanzendorfer2024neuripsw-blap/}
}