Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models

Jiang, Yuhua; Huang, Jiawei; Yuan, Yufeng; Mao, Xin; YuYue,; Zhao, Qianchuan; Yan, Lin

Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models

Yuhua Jiang, Jiawei Huang, Yufeng Yuan, Xin Mao, YuYue, Qianchuan Zhao, Lin Yan

ICLR 2026

/iclr/2026/jiang2026iclr-risksensitive/

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs) on complex reasoning tasks. Yet current methods face an exploration dilemma: standard RL struggles to escape the local optima of pre-trained LLMs’ sharply peaked initial policies, boosting single-solution accuracy (pass@1) but suppressing solution diversity and multi-solution performance (pass@k). As a result, RLVR often distills existing capabilities rather than discovering new reasoning strategies. We address this with a Risk-Sensitive Reinforcement Learning framework. By adopting a risk-seeking objective that interpolates between mean and maximum rewards, we derive a novel Risk-Sensitive GRPO (RS-GRPO) algorithm that emphasizes hard prompts to drive exploration. Across six mathematical reasoning benchmarks and five LLMs, RS-GRPO consistently improves pass@k performance while enhancing or maintaing pass@1.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Jiang et al. "Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models." International Conference on Learning Representations, 2026.

Markdown

[Jiang et al. "Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/jiang2026iclr-risksensitive/)

BibTeX

@inproceedings{jiang2026iclr-risksensitive,
  title     = {{Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models}},
  author    = {Jiang, Yuhua and Huang, Jiawei and Yuan, Yufeng and Mao, Xin and YuYue,  and Zhao, Qianchuan and Yan, Lin},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/jiang2026iclr-risksensitive/}
}