Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition

Abstract

Video-to-audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos.Following the perspective of extracting essential signals from videos that can precisely control the mature text-to-audio generative diffusion models, this paper presents how to balance the representation of mel-spectrograms in terms of completeness and complexity through a new approach called Mel Quantization-Continuum Decomposition (Mel-QCD).We decompose the mel-spectrogram into three distinct types of signals, employing quantization or continuity to them, we can effectively predict them from video by a devised video-to-all (V2X) predictor.Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process.Our proposed Mel-QCD method demonstrates state-of-the-art performance across eight metrics, evaluating dimensions such as quality, synchronization, and semantic consistency.

Cite

Text

Wang et al. "Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00296

Markdown

[Wang et al. "Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/wang2025cvpr-synchronized/) doi:10.1109/CVPR52734.2025.00296

BibTeX

@inproceedings{wang2025cvpr-synchronized,
  title     = {{Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition}},
  author    = {Wang, Juncheng and Xu, Chao and Yu, Cheng and Shang, Lei and Hu, Zhe and Wang, Shujun and Bo, Liefeng},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {3111-3120},
  doi       = {10.1109/CVPR52734.2025.00296},
  url       = {https://mlanthology.org/cvpr/2025/wang2025cvpr-synchronized/}
}