Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation

Kang Zhang, Trung X. Pham, Suyeon Lee, Axi Niu, Arda Senocak, Joon Son Chung

NeurIPS 2025

/neurips/2025/zhang2025neurips-modelguided/

Abstract

We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer denoiser, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging UnAV-100 benchmark. These results highlight model-guided dual-role alignment as a powerful and scalable paradigm for conditional video-to-audio generation. Code is available at: https://github.com/pantheon5100/mgaudio

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Zhang et al. "Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation." Advances in Neural Information Processing Systems, 2025.

Markdown

[Zhang et al. "Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhang2025neurips-modelguided/)

BibTeX

@inproceedings{zhang2025neurips-modelguided,
  title     = {{Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation}},
  author    = {Zhang, Kang and Pham, Trung X. and Lee, Suyeon and Niu, Axi and Senocak, Arda and Chung, Joon Son},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/zhang2025neurips-modelguided/}
}