Learn to Reason Efficiently with Adaptive Length-Based Reward Shaping

Liu, Wei; Zhou, Ruochen; Deng, Yiyun; Huang, Yuzhen; Liu, Junteng; Deng, Yuntian; Zhang, Yizhe; He, Junxian

Learn to Reason Efficiently with Adaptive Length-Based Reward Shaping

Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, Junxian He

ICLR 2026

/iclr/2026/liu2026iclr-learn/

Abstract

Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel **L**ength-b**A**sed **S**t**E**p **R**eward shaping method (LASER), which employs a step function as the reward based on target length. LASER surpasses previous methods, achieving a superior trade-off between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves dynamically during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (**D**ynamic and **D**ifficulty-aware). Experiments on five open-weight models from 1.5B to 32B demonstrate that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D achieves a **5.3** improvement on AIME2024 while reducing token usage by **64**%. Further analysis reveals that our RL-based compression produces more concise reasoning patterns with less redundant ``self-reflections''. All resources (Models, Code, Data) are available at https://github.com/hkust-nlp/Laser.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Liu et al. "Learn to Reason Efficiently with Adaptive Length-Based Reward Shaping." International Conference on Learning Representations, 2026.

Markdown

[Liu et al. "Learn to Reason Efficiently with Adaptive Length-Based Reward Shaping." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/liu2026iclr-learn/)

BibTeX

@inproceedings{liu2026iclr-learn,
  title     = {{Learn to Reason Efficiently with Adaptive Length-Based Reward Shaping}},
  author    = {Liu, Wei and Zhou, Ruochen and Deng, Yiyun and Huang, Yuzhen and Liu, Junteng and Deng, Yuntian and Zhang, Yizhe and He, Junxian},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/liu2026iclr-learn/}
}