Scaling Safe Policy Improvement: Monte Carlo Tree Search and Policy Iteration Strategies

Abstract

Offline Reinforcement Learning (RL) allows policies to be trained on pre-collected datasets without requiring additional interactions with the environment. This approach bypasses the need for real-time data acquisition in real-world applications, which can be impractical due to the safety issues inherent in the learning process. However, offline RL faces significant challenges, such as distributional shifts and extrapolation errors, and the resulting policies might underperform compared to the baseline policy. Safe policy improvement algorithms mitigate these issues, enabling the reliable deployment of RL approaches in real-world scenarios where historical data is available, guaranteeing that any policy changes will not result in worse performance compared to the baseline policy used to collect training data. In this paper, we propose MCTS-SPIBB, an algorithm that leverages Monte Carlo Tree Search (MCTS) for scaling safe policy improvement to large domains. We theoretically prove that the policy generated by MCTS-SPIBB converges to the optimal safely improved policy produced by Safe Policy Improvement with Baseline Bootstrapping (SPIBB) as the number of simulations increases. Additionally, we introduce SDP-SPIBB, a novel extension of SPIBB designed to address the scalability limitations of the standard algorithm via Scalable Dynamic Programming. Our empirical analysis across four benchmark domains demonstrates that MCTS-SPIBB and SDP-SPIBB significantly enhance the scalability of safe policy improvement, providing robust and efficient algorithms for large-scale applications. These contributions represent a significant step towards the deployment of safe RL algorithms in complex real-world environments.

Cite

Text

Bianchi et al. "Scaling Safe Policy Improvement: Monte Carlo Tree Search and Policy Iteration Strategies." Journal of Artificial Intelligence Research, 2025. doi:10.1613/JAIR.1.19649

Markdown

[Bianchi et al. "Scaling Safe Policy Improvement: Monte Carlo Tree Search and Policy Iteration Strategies." Journal of Artificial Intelligence Research, 2025.](https://mlanthology.org/jair/2025/bianchi2025jair-scaling/) doi:10.1613/JAIR.1.19649

BibTeX

@article{bianchi2025jair-scaling,
  title     = {{Scaling Safe Policy Improvement: Monte Carlo Tree Search and Policy Iteration Strategies}},
  author    = {Bianchi, Federico and Castellini, Alberto and Zorzi, Edoardo and Simão, Thiago D. and Spaan, Matthijs T. J. and Farinelli, Alessandro},
  journal   = {Journal of Artificial Intelligence Research},
  year      = {2025},
  doi       = {10.1613/JAIR.1.19649},
  volume    = {84},
  url       = {https://mlanthology.org/jair/2025/bianchi2025jair-scaling/}
}