Importance Weighting for Aligning Language Models Under Deployment Distribution Shift
Abstract
Aligning language models (LMs) with human preferences remains challenging partly because popular approaches, such as reinforcement learning from human feedback and direct preference optimization (DPO), often assume that the training data is sufficiently representative of the environment in which the model will be deployed. However, real-world applications frequently involve distribution shifts, e.g., changes in end-user behavior or preferences during usage or deployment, which pose a significant challenge to LM alignment approaches. In this paper, we propose an importance weighting method tailored for DPO, namely IW-DPO, to address distribution shifts in LM alignment. IW-DPO can be applied to joint distribution shifts in the prompts, responses, and preference labels without explicitly assuming the type of distribution shift. Our experimental results on various distribution shift scenarios demonstrate the usefulness of IW-DPO.
Cite
Text
Lodkaew et al. "Importance Weighting for Aligning Language Models Under Deployment Distribution Shift." Transactions on Machine Learning Research, 2025.Markdown
[Lodkaew et al. "Importance Weighting for Aligning Language Models Under Deployment Distribution Shift." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/lodkaew2025tmlr-importance/)BibTeX
@article{lodkaew2025tmlr-importance,
title = {{Importance Weighting for Aligning Language Models Under Deployment Distribution Shift}},
author = {Lodkaew, Thanawat and Fang, Tongtong and Ishida, Takashi and Sugiyama, Masashi},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/lodkaew2025tmlr-importance/}
}