ViSAGe: Video-to-Spatial Audio Generation
Abstract
Spatial audio is essential for enhancing the immersiveness of audio-visual experiences, yet its production typically demands complex recording systems and specialized expertise. In this work, we address a novel problem of generating first-order ambisonics, a widely used spatial audio format, directly from silent videos. To support this task, we introduce YT-Ambigen, a dataset comprising 102K 5-second YouTube video clips paired with corresponding first-order ambisonics. We also propose new evaluation metrics to assess the spatial aspect of generated audio based on audio energy maps and saliency metrics. Furthermore, we present Video-to-Spatial Audio Generation (ViSAGe), an end-to-end framework that generates first-order ambisonics from silent video frames by leveraging CLIP visual features, autoregressive neural audio codec modeling with both directional and visual guidance. Experimental results demonstrate that ViSAGe produces plausible and coherent first-order ambisonics, outperforming two-stage approaches consisting of video-to-audio generation and audio spatialization. Qualitative examples further illustrate that ViSAGe generates temporally aligned high-quality spatial audio that adapts to viewpoint changes.
Cite
Text
Kim et al. "ViSAGe: Video-to-Spatial Audio Generation." International Conference on Learning Representations, 2025.Markdown
[Kim et al. "ViSAGe: Video-to-Spatial Audio Generation." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/kim2025iclr-visage/)BibTeX
@inproceedings{kim2025iclr-visage,
title = {{ViSAGe: Video-to-Spatial Audio Generation}},
author = {Kim, Jaeyeon and Yun, Heeseung and Kim, Gunhee},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/kim2025iclr-visage/}
}