AURELIA: Test-Time Reasoning Distillation in Audio-Visual LLMs
Abstract
Recent advancements in reasoning optimization have greatly enhanced the performance of large language models (LLMs). However, existing work fails to address the complexities of audio-visual scenarios, underscoring the need for further research. In this paper, we introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distils structured, step-by-step reasoning into AVLLMs at test time, improving their ability to process complex multi-modal inputs without additional training or fine-tuning. To further advance AVLLM reasoning skills, we present AVReasonBench, a challenging benchmark comprising 4500 audio-visual questions, each paired with detailed step-by-step reasoning. Our benchmark spans six distinct tasks, including AV-GeoIQ, which evaluates AV reasoning combined with geographical and cultural knowledge. Evaluating 18 AVLLMs on AVReasonBench reveals significant limitations in their multi-modal reasoning capabilities. Using AURELIA, we achieve up to a 100% relative improvement, demonstrating its effectiveness. This performance gain highlights the potential of reasoning-enhanced data generation for advancing AVLLMs in real-world applications.
Cite
Text
Chowdhury et al. "AURELIA: Test-Time Reasoning Distillation in Audio-Visual LLMs." International Conference on Computer Vision, 2025.Markdown
[Chowdhury et al. "AURELIA: Test-Time Reasoning Distillation in Audio-Visual LLMs." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/chowdhury2025iccv-aurelia/)BibTeX
@inproceedings{chowdhury2025iccv-aurelia,
title = {{AURELIA: Test-Time Reasoning Distillation in Audio-Visual LLMs}},
author = {Chowdhury, Sanjoy and Gani, Hanan and Anand, Nishit and Nag, Sayan and Gao, Ruohan and Elhoseiny, Mohamed and Khan, Salman and Manocha, Dinesh},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {22899-22910},
url = {https://mlanthology.org/iccv/2025/chowdhury2025iccv-aurelia/}
}