Defense Against Reward Poisoning Attacks in Reinforcement Learning
Abstract
We study defense strategies against reward poisoning attacks in reinforcement learning. As a threat model, we consider cost-effective targeted attacks---these attacks minimally alter rewards to make the attacker's target policy uniquely optimal under the poisoned rewards, with the optimality gap specified by an attack parameter. Our goal is to design agents that are robust against such attacks in terms of the worst-case utility w.r.t. the true, unpoisoned, rewards while computing their policies under the poisoned rewards. We propose an optimization framework for deriving optimal defense policies, both when the attack parameter is known and unknown. For this optimization framework, we first provide characterization results for generic attack cost functions. These results show that the functional form of the attack cost function and the agent's knowledge about it are critical for establishing lower bounds on the agent's performance, as well as for the computational tractability of the defense problem. We then focus on a cost function based on $\ell_2$ norm, for which we show that the defense problem can be efficiently solved and yields defense policies whose expected returns under the true rewards are lower bounded by their expected returns under the poison rewards. Using simulation-based experiments, we demonstrate the effectiveness and robustness of our defense approach.
Cite
Text
Banihashem et al. "Defense Against Reward Poisoning Attacks in Reinforcement Learning." Transactions on Machine Learning Research, 2023.Markdown
[Banihashem et al. "Defense Against Reward Poisoning Attacks in Reinforcement Learning." Transactions on Machine Learning Research, 2023.](https://mlanthology.org/tmlr/2023/banihashem2023tmlr-defense/)BibTeX
@article{banihashem2023tmlr-defense,
title = {{Defense Against Reward Poisoning Attacks in Reinforcement Learning}},
author = {Banihashem, Kiarash and Singla, Adish and Radanovic, Goran},
journal = {Transactions on Machine Learning Research},
year = {2023},
url = {https://mlanthology.org/tmlr/2023/banihashem2023tmlr-defense/}
}