Universal Trojan Signatures in Reinforcement Learning
Abstract
We present a novel approach for characterizing Trojaned reinforcement learning (RL) agents. By monitoring for discrepancies in how an agent's policy evaluates state observations for choosing an action, we can reliably detect whether the policy is Trojaned. Experiments on the IARPA RL challenge benchmarks show that our approach can effectively detect Trojaned models even in transfer settings with novel RL environments and modified architectures.
Cite
Text
Acharya et al. "Universal Trojan Signatures in Reinforcement Learning." NeurIPS 2023 Workshops: BUGS, 2023.Markdown
[Acharya et al. "Universal Trojan Signatures in Reinforcement Learning." NeurIPS 2023 Workshops: BUGS, 2023.](https://mlanthology.org/neuripsw/2023/acharya2023neuripsw-universal/)BibTeX
@inproceedings{acharya2023neuripsw-universal,
title = {{Universal Trojan Signatures in Reinforcement Learning}},
author = {Acharya, Manoj and Zhou, Weichao and Roy, Anirban and Lin, Xiao and Li, Wenchao and Jha, Susmit},
booktitle = {NeurIPS 2023 Workshops: BUGS},
year = {2023},
url = {https://mlanthology.org/neuripsw/2023/acharya2023neuripsw-universal/}
}