Video-Guided Foley Sound Generation with Multimodal Controls
Abstract
Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce *MultiFoley*, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow).MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation.Through automated evaluations and human studies, we demonstrate that *MultiFoley* successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods.
Cite
Text
Chen et al. "Video-Guided Foley Sound Generation with Multimodal Controls." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01749Markdown
[Chen et al. "Video-Guided Foley Sound Generation with Multimodal Controls." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/chen2025cvpr-videoguided/) doi:10.1109/CVPR52734.2025.01749BibTeX
@inproceedings{chen2025cvpr-videoguided,
title = {{Video-Guided Foley Sound Generation with Multimodal Controls}},
author = {Chen, Ziyang and Seetharaman, Prem and Russell, Bryan and Nieto, Oriol and Bourgin, David and Owens, Andrew and Salamon, Justin},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {18770-18781},
doi = {10.1109/CVPR52734.2025.01749},
url = {https://mlanthology.org/cvpr/2025/chen2025cvpr-videoguided/}
}