TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis

ICCV 2025 pp. 14228-14237

/iccv/2025/ton2025iccv-taro/

Abstract

This paper introduces Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning (TARO), a novel framework for high-fidelity and temporally coherent video-to-audio synthesis. Built upon flow-based transformers, which offer stable training and continuous transformations for enhanced synchronization and audio quality, TARO introduces two key innovations: (1) Timestep-Adaptive Representation Alignment (TRA), which dynamically aligns latent representations by adjusting alignment strength based on the noise schedule, ensuring smooth evolution and improved fidelity, and (2) Onset-Aware Conditioning (OAC), which integrates onset cues that serve as sharp event-driven markers of audio-relevant visual moments to enhance synchronization with dynamic visual events. Extensive experiments on the VGGSound and Landscape datasets demonstrate that TARO outperforms prior methods, achieving relatively 53% lower Frechet Distance (FD), 29% lower Frechet Audio Distance (FAD), and a 97.19% Alignment Accuracy, highlighting its superior audio quality and synchronization precision.

PDF ICCV Semantic Scholar

Cite

Text

Ton et al. "TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis." International Conference on Computer Vision, 2025.

Markdown

[Ton et al. "TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/ton2025iccv-taro/)

BibTeX

@inproceedings{ton2025iccv-taro,
  title     = {{TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis}},
  author    = {Ton, Tri and Hong, Ji Woo and Yoo, Chang D.},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {14228-14237},
  url       = {https://mlanthology.org/iccv/2025/ton2025iccv-taro/}
}