AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

Abstract

Recent Diffusion Transformers (DiTs) have demonstrated impressive capabilities in generating high-quality single-modality content, including images, videos, and audio. However, the potential of transformer-based diffusers to efficiently denoise the Gaussian noises towards superb multimodal content creation remains underexplored. To bridge this gap, we introduce AV-DiT, a novel and efficient audio-visual diffusion transformer designed to generate high-quality, realistic videos with synchronized visual and audio tracks. To minimize model complexity and computational costs, AV-DiT utilizes a modality-shared DiT backbone pre-trained on image-only data, with only lightweight, newly inserted adapters being trainable. This shared backbone facilitates the generation of both audio and video. Specifically, the video branch incorporates a trainable temporal attention layer into a frozen pre-trained DiT block for temporal consistency. Additionally, a small number of trainable parameters adapt the image-based DiT block for audio generation. An extra shared self-attention block from the DiT block, equipped with lightweight parameters, facilitates feature interaction between audio and visual modalities, ensuring alignment. Extensive experiments on the AIST++ and Landscape datasets demonstrate that AV-DiT achieves state-of-the-art performance in joint audio-visual generation with significantly fewer tunable parameters.

Cite

Text

Wang et al. "AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation." NeurIPS 2024 Workshops: Audio_Imagination, 2024.

Markdown

[Wang et al. "AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation." NeurIPS 2024 Workshops: Audio_Imagination, 2024.](https://mlanthology.org/neuripsw/2024/wang2024neuripsw-avdit/)

BibTeX

@inproceedings{wang2024neuripsw-avdit,
  title     = {{AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation}},
  author    = {Wang, Kai and Deng, Shijian and Shi, Jing and Hatzinakos, Dimitrios and Tian, Yapeng},
  booktitle = {NeurIPS 2024 Workshops: Audio_Imagination},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/wang2024neuripsw-avdit/}
}