Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning

Abstract

The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize informative audio segments through supervised fine-tuning, and then incentivizing proficient revisiting via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically revisiting audio segments in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert-level and general-purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio-interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. Project page: https://github.com/wdqqdw/Echo.

Cite

Text

Wu et al. "Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning." International Conference on Learning Representations, 2026.

Markdown

[Wu et al. "Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wu2026iclr-echo/)

BibTeX

@inproceedings{wu2026iclr-echo,
  title     = {{Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning}},
  author    = {Wu, Daiqing and Zhang, Xuan and Yang, Dongbao and Yao, Jiashu and Chen, Longfei and Liu, Qingsong and Zhao, Sicheng and Ma, Can and Kang, Yangyang and Zhou, Yu},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wu2026iclr-echo/}
}