Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment

Cai, Yuang; Yuan, Yuyu; Shi, Jinsheng; Lin, Qinhong

doi:10.1609/AAAI.V39I22.34519

Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment

Yuang Cai, Yuyu Yuan, Jinsheng Shi, Qinhong Lin

AAAI 2025 pp. 23505-23513

doi:10.1609/AAAI.V39I22.34519 /aaai/2025/cai2025aaai-approximated/

Abstract

The alignment of large language models (LLMs) is crucial for generating helpful and harmless content. Existing approaches leverage preference-based human feedback data to learn the reward function and align the LLM with the feedback data. However, these approaches focus on modeling the reward difference between the chosen and rejected demonstrations, rather than directly modeling the true reward from each demonstration. Moreover, these approaches assume that the reward is only obtained at the end of the sentence, which overlooks the modeling of intermediate rewards. These issues lead to insufficient use of training signals in the feedback data, limiting the representation and generalization ability of the reward and potentially resulting in reward hacking. In this paper, we formulate LLM alignment as a Bayesian Inverse Reinforcement Learning (BIRL) problem and propose a novel training objective, Approximated Variational Alignment (AVA), to perform LLM alignment through Approximated Variational Reward Imitation Learning (AVRIL). The BIRL formulation facilitates intermediate reward modeling and direct reward modeling on each individual demonstration, which enhances the utilization of training signals in the feedback data. Experiments show that AVA outperforms existing LLM alignment approaches in reward modeling, RL fine-tuning, and direct optimization.

PDF AAAI Semantic Scholar

Cite

Text

Cai et al. "Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I22.34519

Markdown

[Cai et al. "Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/cai2025aaai-approximated/) doi:10.1609/AAAI.V39I22.34519

BibTeX

@inproceedings{cai2025aaai-approximated,
  title     = {{Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment}},
  author    = {Cai, Yuang and Yuan, Yuyu and Shi, Jinsheng and Lin, Qinhong},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {23505-23513},
  doi       = {10.1609/AAAI.V39I22.34519},
  url       = {https://mlanthology.org/aaai/2025/cai2025aaai-approximated/}
}