CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling

Abstract

We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences. We present experiments on two datasets to evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics. Our findings demonstrate that the proposed model outperforms the baseline in terms of quality and generation speed through introduction of our novel cross-modal easy fusion architectural block. Furthermore, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task.

Cite

Text

Yang et al. "CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-93806-1_16

Markdown

[Yang et al. "CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/yang2024eccvw-cmmd/) doi:10.1007/978-3-031-93806-1_16

BibTeX

@inproceedings{yang2024eccvw-cmmd,
  title     = {{CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling}},
  author    = {Yang, Ruihan and Gamper, Hannes and Braun, Sebastian},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2024},
  pages     = {214-226},
  doi       = {10.1007/978-3-031-93806-1_16},
  url       = {https://mlanthology.org/eccvw/2024/yang2024eccvw-cmmd/}
}