When Is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? a Data-Centric Perspective

Abstract

Evaluating the value of a hypothetical target policy with only a logged dataset is important but challenging. On the one hand, it brings opportunities for safe policy improvement under high-stakes scenarios like clinical guidelines. On the other hand, such opportunities raise a need for precise off-policy evaluation (OPE). While previous work on OPE focused on improving the algorithm in value estimation, in this work, we emphasize the importance of the offline dataset, hence putting forward a data-centric framework for evaluating OPE problems.We propose DataCOPE, a data-centric framework for evaluating OPE in the logged contextual bandit setting, that answers the questions of whether and to what extent we can evaluate a target policy given a dataset. DataCOPE(1) forecasts the overall performance of OPE algorithms without access to the environment, which is especially useful before real-world deployment where evaluating OPE is impossible;(2) identifies the sub-group in the dataset where OPE can be inaccurate;(3) permits evaluations of datasets or data-collection strategies for OPE problems.Our empirical analysis of DataCOPE in the logged contextual bandit settings using healthcare datasets confirms its ability to evaluate both machine-learning and human expert policies like clinical guidelines.Finally, we apply DataCOPE to the task of reward modeling in Large Language Model alignment to demonstrate its scalability in real-world applications.

Cite

Text

Sun et al. "When Is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? a Data-Centric Perspective." Data-centric Machine Learning Research, 2024.

Markdown

[Sun et al. "When Is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? a Data-Centric Perspective." Data-centric Machine Learning Research, 2024.](https://mlanthology.org/dmlr/2024/sun2024dmlr-offpolicy/)

BibTeX

@article{sun2024dmlr-offpolicy,
  title     = {{When Is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? a Data-Centric Perspective}},
  author    = {Sun, Hao and Chan, Alex James and Seedat, Nabeel and Hüyük, Alihan and van der Schaar, Mihaela},
  journal   = {Data-centric Machine Learning Research},
  year      = {2024},
  pages     = {1-36},
  volume    = {1},
  url       = {https://mlanthology.org/dmlr/2024/sun2024dmlr-offpolicy/}
}