Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment

Abstract

Alignment is vital for safely deploying large language models (LLMs). Existing techniques are either reward-based--train a reward model on preference pairs and optimize with reinforcement learning (RL)--or reward-free--directly fine-tune on ranked outputs. Recent research show that well-tuned reward-based pipelines remain the most robust, and single-response demonstrations can outperform pairwise preference data. However, there still exist two key challenges: (1) imbalanced safety dataset that overrepresent common hazards while neglecting long-tail threats; and (2) static reward models that ignore task difficulty, limiting optimization efficiency and attainable gains. To address these limitations, we propose DR-IRL, which Dynamically adjusts Rewards through Inverse Reinforcement Learning. We first train category‑specific reward models using a balanced safety dataset of seven harmful categories as demonstration via IRL. Then we enhance Group Relative Policy Optimization (GRPO) by introducing dynamic reward scaling--adjusting rewards by task difficulty--data-level hardness by text encoder cosine similarity, model-level responsiveness by reward gaps. Extensive experiments across various benchmarks and LLMs demonstrate that DR-IRL outperforms all baseline methods in safety alignment while maintaining usefulness.

Cite

Text

Cheng et al. "Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment." International Conference on Learning Representations, 2026.

Markdown

[Cheng et al. "Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/cheng2026iclr-inverse/)

BibTeX

@inproceedings{cheng2026iclr-inverse,
  title     = {{Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment}},
  author    = {Cheng, Ruoxi and Ma, Hao-Xuan and Wang, Weixin and Duan, Ranjie and Liu, Jiexi and Jia, Xiaoshuang and Qin, Simeng and Cao, Xiaochun and Liu, Yang and Jia, Xiaojun},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/cheng2026iclr-inverse/}
}