Efficient Multimodal Spatial Reasoning via Dynamic and Asymmetric Routing
Abstract
Recently, visualization-of-thought (VoT) has unlocked new opportunities for complex spatial reasoning in multimodal large language models (MLLMs) by complementing verbal reasoning with visual thinking. However, the autoregressive accumulation of lengthy and redundant tokens substantially increases computation and memory costs. In this paper, we present a new efficient framework for multimodal spatial reasoning, named *DARE*, designed to adaptively prune multimodal tokens across different network depths, reasoning hops, and modalities. First, *DARE* devises an intra- and inter-hop-aware differentiable retention mechanism to dynamically estimate token importance both within each reasoning step and across successive hops. Recognizing that deeper network layers encode visual cues into verbal streams, *DARE* introduces an asymmetric compression strategy that prunes tokens according to modality-specific redundancy and semantic importance. Furthermore, *DARE* incorporates a progressive KV-cache retention policy aligned with cross-modal fusion dynamics, further reducing memory overhead during autoregressive reasoning. Our method delivers substantial reductions in computation and memory footprint, averaging a 40.37\% reduction in FLOPs and 46.07\% reduction in KV caches usage, while consistently preserving or even improving reasoning performance across seven multimodal spatial reasoning benchmarks, and further generalizing to broader multimodal reasoning tasks, establishing a scalable and robust recipe for efficient multimodal reasoning.
Cite
Text
Shen et al. "Efficient Multimodal Spatial Reasoning via Dynamic and Asymmetric Routing." International Conference on Learning Representations, 2026.Markdown
[Shen et al. "Efficient Multimodal Spatial Reasoning via Dynamic and Asymmetric Routing." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/shen2026iclr-efficient/)BibTeX
@inproceedings{shen2026iclr-efficient,
title = {{Efficient Multimodal Spatial Reasoning via Dynamic and Asymmetric Routing}},
author = {Shen, Yixian and Bi, Qi and Wang, Zihan and Yang, Zhiheng and Wang, Changshuo and Zhang, Zhi and Tiwari, Prayag and Pimentel, Andy D. and Pathania, Anuj},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/shen2026iclr-efficient/}
}