MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction

Abstract

We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture generation, (2) partial generation (i.e., source imputation), and (3) text-conditioned extraction of arbitrary sources. By formulating both separation and imputation as conditional inpainting tasks in the latent space, our approach supports flexible, class-agnostic manipulation of arbitrary instrument sources. Notably, MGE-LDM can be trained jointly across heterogeneous multi-track datasets (e.g., Slakh2100, MUSDB18, MoisesDB) without relying on predefined instrument categories.

Cite

Text

Chae and Lee. "MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction." Advances in Neural Information Processing Systems, 2025.

Markdown

[Chae and Lee. "MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/chae2025neurips-mgeldm/)

BibTeX

@inproceedings{chae2025neurips-mgeldm,
  title     = {{MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction}},
  author    = {Chae, Yunkee and Lee, Kyogu},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/chae2025neurips-mgeldm/}
}