PrismAudio: Decomposed Chain-of-Thought and Multi-Dimensional Rewards for Video-to-Audio Generation

Abstract

Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce **PrismAudio**, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT-reward correspondence enables **multidimensional RL optimization** that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make this optimization computationally practical, we propose **Fast-GRPO**, which employs hybrid ODE-SDE sampling that dramatically reduces the training overhead compared to existing GRPO implementations. We also introduce **AudioCanvas**, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets, with 300 single-event classes and 501 multi-event samples. Experimental results demonstrate that PrismAudio achieves state-of-the-art performance across all four perceptual dimensions on both the in-domain VGGSound test set and out-of-domain AudioCanvas benchmark. The project page is available at~\url{https://PrismAudio.github.io}.

Cite

Text

Liu et al. "PrismAudio: Decomposed Chain-of-Thought and Multi-Dimensional Rewards for Video-to-Audio Generation." International Conference on Learning Representations, 2026.

Markdown

[Liu et al. "PrismAudio: Decomposed Chain-of-Thought and Multi-Dimensional Rewards for Video-to-Audio Generation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/liu2026iclr-prismaudio/)

BibTeX

@inproceedings{liu2026iclr-prismaudio,
  title     = {{PrismAudio: Decomposed Chain-of-Thought and Multi-Dimensional Rewards for Video-to-Audio Generation}},
  author    = {Liu, Huadai and Luo, Kaicheng and Wang, Wen and Chen, Qian and Sun, Peiwen and Huang, Rongjie and Li, Xiangang and Ye, Jieping and Xue, Wei},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/liu2026iclr-prismaudio/}
}