Complementing a Policy with a Different Observation Space

Abstract

We consider the problem of improving upon a black-box policy which operates on a different observation space than the learner. Such problems occur when augmenting an existing hand-engineered system with a new machine learning model or in a shared autonomy / human-AI complementarity context. We prove that following the naive policy gradient can lead to a decrease in performance because of incorrect grounding in a different observation space. Then, if we have access to both sets of observation at train time, we derive a method for correctly estimating a policy gradient via an application of the backdoor criterion. If we don't, we prove that under certain assumptions, we can use the proxy correction to correctly estimate a direction of improvement.

Cite

Text

Swamy et al. "Complementing a Policy with a Different Observation Space." ICML 2023 Workshops: ILHF, 2023.

Markdown

[Swamy et al. "Complementing a Policy with a Different Observation Space." ICML 2023 Workshops: ILHF, 2023.](https://mlanthology.org/icmlw/2023/swamy2023icmlw-complementing/)

BibTeX

@inproceedings{swamy2023icmlw-complementing,
  title     = {{Complementing a Policy with a Different Observation Space}},
  author    = {Swamy, Gokul and Choudhury, Sanjiban and Bagnell, Drew and Wu, Steven},
  booktitle = {ICML 2023 Workshops: ILHF},
  year      = {2023},
  url       = {https://mlanthology.org/icmlw/2023/swamy2023icmlw-complementing/}
}