Learning General World Models in a Handful of Reward-Free Deployments

Abstract

Building generally capable agents is a grand challenge for deep reinforcement learning (RL). To approach this challenge practically, we outline two key desiderata: 1) to facilitate generalization, exploration should be task agnostic; 2) to facilitate scalability, exploration policies should collect large quantities of data without costly centralized retraining. Combining these two properties, we introduce the reward-free deployment efficiency setting, a new paradigm for RL research. We then present CASCADE, a novel approach for self-supervised exploration in this new setting. CASCADE seeks to learn a world model by collecting data with a population of agents, using an information theoretic objective inspired by Bayesian Active Learning. CASCADE achieves this by specifically maximizing the diversity of trajectories sampled by the population through a novel cascading objective. We provide theoretical intuition for CASCADE which we show in a tabular setting improves upon naïve approaches that do not account for population diversity. We then demonstrate that CASCADE collects diverse task-agnostic datasets and learns agents that generalize zero-shot to novel, unseen downstream tasks on Atari, MiniGrid, Crafter and the DM Control Suite. Code and videos are available at https://ycxuyingchen.github.io/cascade/

Cite

Text

Xu et al. "Learning General World Models in a Handful of Reward-Free Deployments." Neural Information Processing Systems, 2022.

Markdown

[Xu et al. "Learning General World Models in a Handful of Reward-Free Deployments." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/xu2022neurips-learning-a/)

BibTeX

@inproceedings{xu2022neurips-learning-a,
  title     = {{Learning General World Models in a Handful of Reward-Free Deployments}},
  author    = {Xu, Yingchen and Parker-Holder, Jack and Pacchiano, Aldo and Ball, Philip and Rybkin, Oleh and Roberts, S and Rocktäschel, Tim and Grefenstette, Edward},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/xu2022neurips-learning-a/}
}