Information Theoretic Guarantees for Policy Alignment in Large Language Models

Abstract

Policy alignment of large language models refers to constrained policy optimization, where the policy is optimized to maximize a reward while staying close to a reference policy based on an $f$-divergence like $\mathsf{KL}$ divergence. The best of $n$ alignment policy selects the sample with the highest reward from $n$ independent samples. Recent work shows that the reward improvement of the aligned policy scales as $\sqrt{\mathsf{KL}}$, with an explicit bound on the $\mathsf{KL}$ for best of $n$ policies. We show that this $\sqrt{\mathsf{KL}}$ bound holds if the reference policy’s reward has sub-gaussian tails. For best of $n$ policies, the $\mathsf{KL}$ bound applies to any $f$-divergence through a reduction to exponential order statistics using the Rényi representation. Tighter control can be achieved with Rényi divergence if additional tail information is known. Finally, we demonstrate how these bounds transfer to golden rewards, resulting in decreased golden reward improvement due to proxy reward overestimation and approximation errors.

Cite

Text

Mroueh and Nitsure. "Information Theoretic Guarantees for Policy Alignment in Large Language Models." Transactions on Machine Learning Research, 2025.

Markdown

[Mroueh and Nitsure. "Information Theoretic Guarantees for Policy Alignment in Large Language Models." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/mroueh2025tmlr-information/)

BibTeX

@article{mroueh2025tmlr-information,
  title     = {{Information Theoretic Guarantees for Policy Alignment in Large Language Models}},
  author    = {Mroueh, Youssef and Nitsure, Apoorva},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/mroueh2025tmlr-information/}
}