Towards Fast Safe Online Reinforcement Learning via Policy Finetuning
Abstract
High costs and risks involved in extensive environmental interactions hinder the practical application of current online safe reinforcement learning (RL) methods. Inspired by recent successes in offline-to-online (O2O) RL, it is crucial to explore whether offline safe RL can be leveraged to facilitate faster and safer online learning, a direction that has yet to be fully investigated. To fill this gap, we first show that naively applying existing O2O algorithms from standard RL would not work well in safe RL due to two unique challenges: \emph{erroneous Q-estimations}, resulted from offline-online objective mismatch and offline cost sparsity, and \emph{Lagrangian mismatch}, resulted from difficulties in aligning Lagrange multipliers between offline and online policies. To address these challenges, we introduce \textbf{Marvel}, the first policy-finetuning based framework for O2O safe RL, comprising two key components that work in concert: \emph{Value Pre-Alignment} to align the learned Q-functions with the online objective before finetuning, and \emph{Adaptive PID Control} to effectively adjust the Lagrange multipliers during finetuning. Extensive experiments demonstrate the superior performance of Marvel over related baselines.
Cite
Text
Chen et al. "Towards Fast Safe Online Reinforcement Learning via Policy Finetuning." Transactions on Machine Learning Research, 2026.Markdown
[Chen et al. "Towards Fast Safe Online Reinforcement Learning via Policy Finetuning." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/chen2026tmlr-fast/)BibTeX
@article{chen2026tmlr-fast,
title = {{Towards Fast Safe Online Reinforcement Learning via Policy Finetuning}},
author = {Chen, Keru and Wei, Honghao and Deng, Zhigang and Lin, Sen},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/chen2026tmlr-fast/}
}