Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs) on complex reasoning tasks. Yet current methods face an exploration dilemma: standard RL struggles to escape the local optima of pre-trained LLMs’ sharply peaked initial policies, boosting single-solution accuracy (pass@1) but suppressing solution diversity and multi-solution performance (pass@k). As a result, RLVR often distills existing capabilities rather than discovering new reasoning strategies. We address this with a Risk-Sensitive Reinforcement Learning framework. By adopting a risk-seeking objective that interpolates between mean and maximum rewards, we derive a novel Risk-Sensitive GRPO (RS-GRPO) algorithm that emphasizes hard prompts to drive exploration. Across six mathematical reasoning benchmarks and five LLMs, RS-GRPO consistently improves pass@k performance while enhancing or maintaing pass@1.

Cite

Text

Jiang et al. "Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models." International Conference on Learning Representations, 2026.

Markdown

[Jiang et al. "Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/jiang2026iclr-risksensitive/)

BibTeX

@inproceedings{jiang2026iclr-risksensitive,
  title     = {{Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models}},
  author    = {Jiang, Yuhua and Huang, Jiawei and Yuan, Yufeng and Mao, Xin and YuYue,  and Zhao, Qianchuan and Yan, Lin},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/jiang2026iclr-risksensitive/}
}