CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling
Abstract
We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences. We present experiments on two datasets to evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics. Our findings demonstrate that the proposed model outperforms the baseline in terms of quality and generation speed through introduction of our novel cross-modal easy fusion architectural block. Furthermore, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task.
Cite
Text
Yang et al. "CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-93806-1_16Markdown
[Yang et al. "CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/yang2024eccvw-cmmd/) doi:10.1007/978-3-031-93806-1_16BibTeX
@inproceedings{yang2024eccvw-cmmd,
title = {{CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling}},
author = {Yang, Ruihan and Gamper, Hannes and Braun, Sebastian},
booktitle = {European Conference on Computer Vision Workshops},
year = {2024},
pages = {214-226},
doi = {10.1007/978-3-031-93806-1_16},
url = {https://mlanthology.org/eccvw/2024/yang2024eccvw-cmmd/}
}