When Is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? a Data-Centric Perspective
Abstract
Evaluating the value of a hypothetical target policy with only a logged dataset is important but challenging. On the one hand, it brings opportunities for safe policy improvement under high-stakes scenarios like clinical guidelines. On the other hand, such opportunities raise a need for precise off-policy evaluation (OPE). While previous work on OPE focused on improving the algorithm in value estimation, in this work, we emphasize the importance of the offline dataset, hence putting forward a data-centric framework for evaluating OPE problems.We propose DataCOPE, a data-centric framework for evaluating OPE in the logged contextual bandit setting, that answers the questions of whether and to what extent we can evaluate a target policy given a dataset. DataCOPE(1) forecasts the overall performance of OPE algorithms without access to the environment, which is especially useful before real-world deployment where evaluating OPE is impossible;(2) identifies the sub-group in the dataset where OPE can be inaccurate;(3) permits evaluations of datasets or data-collection strategies for OPE problems.Our empirical analysis of DataCOPE in the logged contextual bandit settings using healthcare datasets confirms its ability to evaluate both machine-learning and human expert policies like clinical guidelines.Finally, we apply DataCOPE to the task of reward modeling in Large Language Model alignment to demonstrate its scalability in real-world applications.
Cite
Text
Sun et al. "When Is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? a Data-Centric Perspective." Data-centric Machine Learning Research, 2024.Markdown
[Sun et al. "When Is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? a Data-Centric Perspective." Data-centric Machine Learning Research, 2024.](https://mlanthology.org/dmlr/2024/sun2024dmlr-offpolicy/)BibTeX
@article{sun2024dmlr-offpolicy,
title = {{When Is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? a Data-Centric Perspective}},
author = {Sun, Hao and Chan, Alex James and Seedat, Nabeel and Hüyük, Alihan and van der Schaar, Mihaela},
journal = {Data-centric Machine Learning Research},
year = {2024},
pages = {1-36},
volume = {1},
url = {https://mlanthology.org/dmlr/2024/sun2024dmlr-offpolicy/}
}