SpOiLer: Offline Reinforcement Learning Using Scaled Penalties
Abstract
Offline Reinforcement Learning (RL) is a variant of off-policy learning where an optimal policy must be learned from a static dataset containing trajectories collected by an unknown behavior policy. In the offline setting, standard off-policy algorithms will overestimate values of out-of-distribution actions and a policy trained naively in this way will perform poorly in the environment due to distribution shift between the implied and real environment; this is especially likely when modelling complex and multi-modal data distributions. We propose Scaled-penalty Offline Learning (SpOiLer), an offline reinforcement learning algorithm that reduces the value of out-of-distribution actions relative to observed actions. The resultant pessimistic value function is a lower bound of the true value function and manipulates the policy towards selecting actions present in the dataset. Our method is a simple augmentation to the standard Bellman backup operator and implementation requires around 15 additional lines of code over soft actor-critic. We provide theoretical insights into how SpOiLer operates under the hood and show empirically that SpOiLer achieves remarkable performance against prior methods on a range of tasks.
Cite
Text
Srinivasan and Knottenbelt. "SpOiLer: Offline Reinforcement Learning Using Scaled Penalties." Proceedings of the 6th Annual Learning for Dynamics & Control Conference, 2024.Markdown
[Srinivasan and Knottenbelt. "SpOiLer: Offline Reinforcement Learning Using Scaled Penalties." Proceedings of the 6th Annual Learning for Dynamics & Control Conference, 2024.](https://mlanthology.org/l4dc/2024/srinivasan2024l4dc-spoiler/)BibTeX
@inproceedings{srinivasan2024l4dc-spoiler,
title = {{SpOiLer: Offline Reinforcement Learning Using Scaled Penalties}},
author = {Srinivasan, Padmanaba and Knottenbelt, William J.},
booktitle = {Proceedings of the 6th Annual Learning for Dynamics & Control Conference},
year = {2024},
pages = {825-838},
volume = {242},
url = {https://mlanthology.org/l4dc/2024/srinivasan2024l4dc-spoiler/}
}