Buy 4 REINFORCE Samples, Get a Baseline for Free!

Abstract

REINFORCE can be used to train models in structured prediction settings to directly optimize the test-time objective. However, the common case of sampling one prediction per datapoint (input) is data-inefficient. We show that by drawing multiple samples (predictions) per datapoint, we can learn with significantly less data, as we freely obtain a REINFORCE baseline to reduce variance. Additionally we derive a REINFORCE estimator with baseline, based on sampling without replacement. Combined with a recent technique to sample sequences without replacement using Stochastic Beam Search, this improves the training procedure for a sequence model that predicts the solution to the Travelling Salesman Problem.

Cite

Text

Kool et al. "Buy 4 REINFORCE Samples, Get a Baseline for Free!." ICLR 2019 Workshops: drlStructPred, 2019.

Markdown

[Kool et al. "Buy 4 REINFORCE Samples, Get a Baseline for Free!." ICLR 2019 Workshops: drlStructPred, 2019.](https://mlanthology.org/iclrw/2019/kool2019iclrw-buy/)

BibTeX

@inproceedings{kool2019iclrw-buy,
  title     = {{Buy 4 REINFORCE Samples, Get a Baseline for Free!}},
  author    = {Kool, Wouter and van Hoof, Herke and Welling, Max},
  booktitle = {ICLR 2019 Workshops: drlStructPred},
  year      = {2019},
  url       = {https://mlanthology.org/iclrw/2019/kool2019iclrw-buy/}
}