Residual Loss Prediction: Reinforcement Learning with No Incremental Feedback

Abstract

We consider reinforcement learning and bandit structured prediction problems with very sparse loss feedback: only at the end of an episode. We introduce a novel algorithm, RESIDUAL LOSS PREDICTION (RESLOPE), that solves such problems by automatically learning an internal representation of a denser reward function. RESLOPE operates as a reduction to contextual bandits, using its learned loss representation to solve the credit assignment problem, and a contextual bandit oracle to trade-off exploration and exploitation. RESLOPE enjoys a no-regret reduction-style theoretical guarantee and outperforms state of the art reinforcement learning algorithms in both MDP environments and bandit structured prediction settings.

Cite

Text

Iii et al. "Residual Loss Prediction: Reinforcement Learning with No Incremental Feedback." International Conference on Learning Representations, 2018.

Markdown

[Iii et al. "Residual Loss Prediction: Reinforcement Learning with No Incremental Feedback." International Conference on Learning Representations, 2018.](https://mlanthology.org/iclr/2018/iii2018iclr-residual/)

BibTeX

@inproceedings{iii2018iclr-residual,
  title     = {{Residual Loss Prediction: Reinforcement Learning with No Incremental Feedback}},
  author    = {Iii, Hal Daumé and Langford, John and Sharaf, Amr},
  booktitle = {International Conference on Learning Representations},
  year      = {2018},
  url       = {https://mlanthology.org/iclr/2018/iii2018iclr-residual/}
}