DIVA-GRPO: Enhancing Multimodal Reasoning Through Difficulty-Adaptive Variant Advantage

Gao, Haowen; Zhang, Zhenyu; Pang, Liang; Guo, Fangda; Douhongjian,; Lv, Guannan; Liu, ShaoGuo; Gao, Tingting; Shen, Huawei; Cheng, Xueqi

DIVA-GRPO: Enhancing Multimodal Reasoning Through Difficulty-Adaptive Variant Advantage

Haowen Gao, Zhenyu Zhang, Liang Pang, Fangda Guo, Douhongjian, Guannan Lv, ShaoGuo Liu, Tingting Gao, Huawei Shen, Xueqi Cheng

ICLR 2026

/iclr/2026/gao2026iclr-divagrpo/

Abstract

Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a traditional critic model, it often suffers from sparse rewards, arising from the scarcity of positive feedback on difficult problems, and from advantage vanishing, which occurs when group-level rewards exhibit high consistency for problems that are too easy or too hard. Existing solutions fall into three categories: sample enhancement and expansion, which may aggravate vanishing advantage due to poor control of difficulty distribution; selective sample utilization, which fails to fully leverage the value of all data; and indirect reward design, which may introduce biased optimization directions due to misalignment between reasoning and the final outcome. However, these approaches overlook a fundamental question: for a given problem, how can we ensure that the within-group reward distribution of responses exhibits enough variance to yield clear optimization signals for each response? To address these issues, we propose DIVA-GRPO, a difficulty-adaptive variant augmentation advantage method that dynamically adjusts the difficulty distribution of variants for each problem from a global perspective. Our method dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and advantages are computed within both local and global(a problem and its variants) groups using difficulty-weighted and normalized scaling. This design alleviates reward sparsity and advantage vanishing, minimizes data waste, and improves training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA-GRPO outperforms existing approaches in both training efficiency and reasoning performance.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Gao et al. "DIVA-GRPO: Enhancing Multimodal Reasoning Through Difficulty-Adaptive Variant Advantage." International Conference on Learning Representations, 2026.

Markdown

[Gao et al. "DIVA-GRPO: Enhancing Multimodal Reasoning Through Difficulty-Adaptive Variant Advantage." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/gao2026iclr-divagrpo/)

BibTeX

@inproceedings{gao2026iclr-divagrpo,
  title     = {{DIVA-GRPO: Enhancing Multimodal Reasoning Through Difficulty-Adaptive Variant Advantage}},
  author    = {Gao, Haowen and Zhang, Zhenyu and Pang, Liang and Guo, Fangda and Douhongjian,  and Lv, Guannan and Liu, ShaoGuo and Gao, Tingting and Shen, Huawei and Cheng, Xueqi},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/gao2026iclr-divagrpo/}
}