PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning
Abstract
Inspired by the impressive reasoning capabilities demonstrated by reinforcement learning approaches like DeepSeek-R1, recent emerging research has begun exploring the use of reinforcement learning (RL) to enhance vision-language models (VLMs) for multimodal reasoning tasks. However, most existing multimodal reinforcement learning approaches remain limited to spatial reasoning within single-image contexts, yet still struggle to generalize to more complex and real-world scenarios involving multi-image positional reasoning, where understanding the relationships across images is crucial. To address this challenge, we propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks, and a multi-stage strategy designed to enhance the exploration-exploitation trade-off, thereby improving learning efficiency and task performance. Specifically, we introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity. Furthermore, we design a rollout filtering mechanism for resampling to focus on trajectories that contribute most to learning optimal behaviors to exploit learned policies effectively. We evaluate our model on 5 widely-used multi-image benchmarks and 3 single-image benchmarks. Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin, achieving state-of-the-art performance on multi-image benchmarks, while preserving comparable performance on single-image tasks.
Cite
Text
Zhang et al. "PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning." Advances in Neural Information Processing Systems, 2025.Markdown
[Zhang et al. "PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhang2025neurips-perl/)BibTeX
@inproceedings{zhang2025neurips-perl,
title = {{PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning}},
author = {Zhang, Yizhen and Ding, Yang and Zhang, Shuoshuo and Zhang, Xinchen and Li, Haoling and Li, Zhong-Zhi and Wang, Peijie and Wu, Jie and Ji, Lei and Gong, Yeyun and Shen, Yelong and Yang, Yujiu},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/zhang2025neurips-perl/}
}