Distributional Successor Features Enable Zero-Shot Policy Optimization

Abstract

Intelligent agents must be generalists, capable of quickly adapting to various tasks. In reinforcement learning (RL), model-based RL learns a dynamics model of the world, in principle enabling transfer to arbitrary reward functions through planning. However, autoregressive model rollouts suffer from compounding error, making model-based RL ineffective for long-horizon problems. Successor features offer an alternative by modeling a policy's long-term state occupancy, reducing policy evaluation under new rewards to linear regression. Yet, policy optimization with successor features can be challenging. This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs), that learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features within the dataset. By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions. We present a practical instantiation of DiSPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code are available at https://weirdlabuw.github.io/dispo/.

Cite

Text

Zhu et al. "Distributional Successor Features Enable Zero-Shot Policy Optimization." Neural Information Processing Systems, 2024. doi:10.52202/079017-3959

Markdown

[Zhu et al. "Distributional Successor Features Enable Zero-Shot Policy Optimization." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/zhu2024neurips-distributional/) doi:10.52202/079017-3959

BibTeX

@inproceedings{zhu2024neurips-distributional,
  title     = {{Distributional Successor Features Enable Zero-Shot Policy Optimization}},
  author    = {Zhu, Chuning and Wang, Xinqi and Han, Tyler and Du, Simon Shaolei and Gupta, Abhishek},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-3959},
  url       = {https://mlanthology.org/neurips/2024/zhu2024neurips-distributional/}
}