MARS-Sep: Multimodal-Aligned Reinforced Sound Separation
Abstract
Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources. We introduce a preference alignment perspective, analogous to aligning LLMs with human intent. To address this, we introduce MARS-Sep, a reinforcement learning framework that reformulates separation as decision making. Instead of simply regressing ground-truth masks, MARS-Sep learns a factorized Beta mask policy that is steered by a preference reward model and optimized by a stable, clipped trust-region surrogate. The reward, derived from a progressively-aligned audio-text-vision encoder, directly incentivizes semantic consistency with query prompts. Extensive experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation, with notable improvements in signal metrics and semantic quality. Our code is available at https://github.com/mars-sep/MARS-Sep. Sound separation samples are available at https://mars-sep.github.io/.
Cite
Text
Zhang et al. "MARS-Sep: Multimodal-Aligned Reinforced Sound Separation." International Conference on Learning Representations, 2026.Markdown
[Zhang et al. "MARS-Sep: Multimodal-Aligned Reinforced Sound Separation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhang2026iclr-marssep/)BibTeX
@inproceedings{zhang2026iclr-marssep,
title = {{MARS-Sep: Multimodal-Aligned Reinforced Sound Separation}},
author = {Zhang, Zihan and Cheng, Xize and Jiang, Zhennan and Fu, Dongjie and Chen, Jingyuan and Zhao, Zhou and Jin, Tao},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/zhang2026iclr-marssep/}
}