Residual Loss Prediction: Reinforcement Learning with No Incremental Feedback

Hal Daumé Iii, John Langford, Amr Sharaf

ICLR 2018

/iclr/2018/iii2018iclr-residual/

Abstract

We consider reinforcement learning and bandit structured prediction problems with very sparse loss feedback: only at the end of an episode. We introduce a novel algorithm, RESIDUAL LOSS PREDICTION (RESLOPE), that solves such problems by automatically learning an internal representation of a denser reward function. RESLOPE operates as a reduction to contextual bandits, using its learned loss representation to solve the credit assignment problem, and a contextual bandit oracle to trade-off exploration and exploitation. RESLOPE enjoys a no-regret reduction-style theoretical guarantee and outperforms state of the art reinforcement learning algorithms in both MDP environments and bandit structured prediction settings.

PDF ICLR Code Semantic Scholar

Cite

Text

Iii et al. "Residual Loss Prediction: Reinforcement Learning with No Incremental Feedback." International Conference on Learning Representations, 2018.

Markdown

[Iii et al. "Residual Loss Prediction: Reinforcement Learning with No Incremental Feedback." International Conference on Learning Representations, 2018.](https://mlanthology.org/iclr/2018/iii2018iclr-residual/)

BibTeX

@inproceedings{iii2018iclr-residual,
  title     = {{Residual Loss Prediction: Reinforcement Learning with No Incremental Feedback}},
  author    = {Iii, Hal Daumé and Langford, John and Sharaf, Amr},
  booktitle = {International Conference on Learning Representations},
  year      = {2018},
  url       = {https://mlanthology.org/iclr/2018/iii2018iclr-residual/}
}