Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Abstract

The Video-to-Audio (V2A) model has recently gained attention for its practical application in generating audio directly from silent videos, particularly in video/film production. However, previous methods in V2A have limited generation quality in terms of temporal synchronization and audio-visual relevance. We present Diff-Foley, a synchronized Video-to-Audio synthesis method with a latent diffusion model (LDM) that generates high-quality audio with improved synchronization and audio-visual relevance. We adopt contrastive audio-visual pretraining (CAVP) to learn more temporally and semantically aligned features, then train an LDM with CAVP-aligned visual features on spectrogram latent space. The CAVP-aligned features enable LDM to capture the subtler audio-visual correlation via a cross-attention module. We further significantly improve sample quality with `double guidance'. Diff-Foley achieves state-of-the-art V2A performance on current large scale V2A dataset. Furthermore, we demonstrate Diff-Foley practical applicability and adaptability via customized downstream finetuning. Project Page: https://diff-foley.github.io/

Cite

Text

Luo et al. "Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models." Neural Information Processing Systems, 2023.

Markdown

[Luo et al. "Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/luo2023neurips-difffoley/)

BibTeX

@inproceedings{luo2023neurips-difffoley,
  title     = {{Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models}},
  author    = {Luo, Simian and Yan, Chuanhao and Hu, Chenxu and Zhao, Hang},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/luo2023neurips-difffoley/}
}