VAFlow: Video-to-Audio Generation with Cross-Modality Flow Matching

Wang, Xihua; Cheng, Xin; Wang, Yuyue; Song, Ruihua; Wang, Yunfeng

VAFlow: Video-to-Audio Generation with Cross-Modality Flow Matching

Xihua Wang, Xin Cheng, Yuyue Wang, Ruihua Song, Yunfeng Wang

ICCV 2025 pp. 11777-11786

/iccv/2025/wang2025iccv-vaflow/

Abstract

Video-to-audio (V2A) generation aims to synthesize temporally aligned, realistic sounds for silent videos, a critical capability for immersive multimedia applications. Current V2A methods, predominantly based on diffusion or flow models, rely on suboptimal noise-to-audio paradigms that entangle cross-modal mappings with stochastic priors, resulting in inefficient training and convoluted transport paths. We propose VAFlow, a novel flow-based framework that directly models the video-to-audio transformation, eliminating reliance on noise priors. To address modality discrepancies, we employ an alignment variational autoencoder that compresses heterogeneous video features into audio-aligned latent spaces while preserving spatiotemporal semantics. By retaining cross-attention mechanisms between video features and flow blocks, our architecture enables classifier-free guidance within video source-driven generation. Without external data or complex training tricks, VAFlow achieves state-of-the-art performance on VGGSound benchmark, surpassing even text-augmented models in audio fidelity, diversity, and distribution alignment. This work establishes a new paradigm for V2A generation with a direct and effective video-to-audio transformation via flow matching.

PDF ICCV Semantic Scholar

Cite

Text

Wang et al. "VAFlow: Video-to-Audio Generation with Cross-Modality Flow Matching." International Conference on Computer Vision, 2025.

Markdown

[Wang et al. "VAFlow: Video-to-Audio Generation with Cross-Modality Flow Matching." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/wang2025iccv-vaflow/)

BibTeX

@inproceedings{wang2025iccv-vaflow,
  title     = {{VAFlow: Video-to-Audio Generation with Cross-Modality Flow Matching}},
  author    = {Wang, Xihua and Cheng, Xin and Wang, Yuyue and Song, Ruihua and Wang, Yunfeng},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {11777-11786},
  url       = {https://mlanthology.org/iccv/2025/wang2025iccv-vaflow/}
}