Meta-Q-Learning

Abstract

This paper introduces Meta-Q-Learning (MQL), a new off-policy algorithm for meta-Reinforcement Learning (meta-RL). MQL builds upon three simple ideas. First, we show that Q-learning is competitive with state-of-the-art meta-RL algorithms if given access to a context variable that is a representation of the past trajectory. Second, a multi-task objective to maximize the average reward across the training tasks is an effective method to meta-train RL policies. Third, past data from the meta-training replay buffer can be recycled to adapt the policy on a new task using off-policy updates. MQL draws upon ideas in propensity estimation to do so and thereby amplifies the amount of available data for adaptation. Experiments on standard continuous-control benchmarks suggest that MQL compares favorably with the state of the art in meta-RL.

Cite

Text

Fakoor et al. "Meta-Q-Learning." International Conference on Learning Representations, 2020.

Markdown

[Fakoor et al. "Meta-Q-Learning." International Conference on Learning Representations, 2020.](https://mlanthology.org/iclr/2020/fakoor2020iclr-metaqlearning/)

BibTeX

@inproceedings{fakoor2020iclr-metaqlearning,
  title     = {{Meta-Q-Learning}},
  author    = {Fakoor, Rasool and Chaudhari, Pratik and Soatto, Stefano and Smola, Alexander J.},
  booktitle = {International Conference on Learning Representations},
  year      = {2020},
  url       = {https://mlanthology.org/iclr/2020/fakoor2020iclr-metaqlearning/}
}