AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation
Abstract
We propose AV-Link, a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models through temporally-aligned self-attention operations. Unlike prior work that uses dedicated models for A2V and V2A tasks and relies on pretrained feature extractors, AV-Link achieves both tasks in a single self-contained framework, directly leveraging features obtained by the complementary modality (i.e. video features to generate audio, or audio features to generate video). Extensive evaluations demonstrate that AV-Link achieves substantial improvements in audio-video synchronization, outperforming more expensive baselines such as MovieGen V2A model.
Cite
Text
Haji-Ali et al. "AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation." International Conference on Computer Vision, 2025.Markdown
[Haji-Ali et al. "AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/hajiali2025iccv-avlink/)BibTeX
@inproceedings{hajiali2025iccv-avlink,
title = {{AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation}},
author = {Haji-Ali, Moayed and Menapace, Willi and Siarohin, Aliaksandr and Skorokhodov, Ivan and Canberk, Alper and Lee, Kwot Sin and Ordonez, Vicente and Tulyakov, Sergey},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {19373-19385},
url = {https://mlanthology.org/iccv/2025/hajiali2025iccv-avlink/}
}