VMAs: Video-to-Music Generation via Semantic Alignment in Web Music Videos

Abstract

We present a framework for learning to generate background music from video inputs. Unlike existing works that rely on symbolic musical annotations which are limited in quantity and diversity our method leverages large-scale web videos accompanied by background music. This enables our model to learn to generate realistic and diverse music. To accomplish this goal we develop a generative video-music Transformer with a novel semantic video-music alignment scheme. Our model uses a joint autoregressive and contrastive learning objective which encourages the generation of music aligned with high-level video content. We also introduce a novel video-beat alignment scheme to match the generated music beats with the low-level motions in the video. Lastly to capture fine-grained visual cues in a video needed for realistic background music generation we introduce a new temporal video encoder architecture allowing us to efficiently process videos consisting of many densely sampled frames. We train our framework on our newly curated DISCO-MV dataset consisting of 2.2M video-music samples which is orders of magnitude larger than any prior datasets used for video music generation. Our method outperforms existing approaches on the DISCO-MV and MusicCaps datasets according to various music generation evaluation metrics including human evaluation. Results are available at https://genjib.github.io/project_page/VMAs/index.html

Cite

Text

Lin et al. "VMAs: Video-to-Music Generation via Semantic Alignment in Web Music Videos." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Lin et al. "VMAs: Video-to-Music Generation via Semantic Alignment in Web Music Videos." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/lin2025wacv-vmas/)

BibTeX

@inproceedings{lin2025wacv-vmas,
  title     = {{VMAs: Video-to-Music Generation via Semantic Alignment in Web Music Videos}},
  author    = {Lin, Yan-Bo and Tian, Yu and Yang, Linjie and Bertasius, Gedas and Wang, Heng},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {1155-1165},
  url       = {https://mlanthology.org/wacv/2025/lin2025wacv-vmas/}
}