Offline Reinforcement Learning via Tsallis Regularization
Abstract
Offline reinforcement learning (RL) focuses on learning a good policy from a fixed dataset. The dataset is generated by an unknown behavior policy through interactions with the environment and contains only a subset of the state-action spaces. Standard off-policy algorithms often perform poorly in this setting, suffering from errorneously optimistic values incurred by the out-of-distribution (OOD) actions not present in the dataset. The optimisim cannot be corrected as no further interaction with the environment is possible. Imposing divergence regularization and in-sample constraints are among the most popular methods to overcoming the issue by ensuring that the learned policy stays close to the behavior policy to minimize the occurrence of OOD actions. This paper proposes Tsallis regularization for offline RL, which aligns the induced sparsemax policies to the in-sample constraint. Sparsemax interpolates existing methods utilizing hard-max and softmax policies, in that only a subset of actions contributes non-zero action probability as compared to softmax (all actions) and hard-max (single action). We leverage this property to model the behavior policy and show that under several assumptions the learned sparsemax policies may have sparsity-conditional KL divergence to the behavior policy, making Tsallis regularization especially suitable for the Behavior Cloning methods. We propose a novel actor-critic algorithm: Tsallis Advantage Weighted Actor-Critic (Tsallis AWAC) generalizing AWAC and analyze its performance in standard Mujoco environments. Our code is available at \url{https://github.com/lingweizhu/tsallis_regularization}.
Cite
Text
Zhu et al. "Offline Reinforcement Learning via Tsallis Regularization." Transactions on Machine Learning Research, 2024.Markdown
[Zhu et al. "Offline Reinforcement Learning via Tsallis Regularization." Transactions on Machine Learning Research, 2024.](https://mlanthology.org/tmlr/2024/zhu2024tmlr-offline/)BibTeX
@article{zhu2024tmlr-offline,
title = {{Offline Reinforcement Learning via Tsallis Regularization}},
author = {Zhu, Lingwei and Schlegel, Matthew Kyle and Wang, Han and White, Martha},
journal = {Transactions on Machine Learning Research},
year = {2024},
url = {https://mlanthology.org/tmlr/2024/zhu2024tmlr-offline/}
}