PWM: Policy Learning with Multi-Task World Models

Abstract

Reinforcement Learning (RL) has made significant strides in complex tasks but struggles in multi-task settings with different embodiments. World model methods offer scalability by learning a simulation of the environment but often rely on inefficient gradient-free optimization methods for policy extraction. In contrast, gradient-based methods exhibit lower variance but fail to handle discontinuities. Our work reveals that well-regularized world models can generate smoother optimization landscapes than the actual dynamics, facilitating more effective first-order optimization. We introduce Policy learning with multi-task World Models (PWM), a novel model-based RL algorithm for continuous control. Initially, the world model is pre-trained on offline data, and then policies are extracted from it using first-order optimization in less than 10 minutes per task. PWM effectively solves tasks with up to 152 action dimensions and outperforms methods that use ground-truth dynamics. Additionally, PWM scales to an 80-task setting, achieving up to 27\% higher rewards than existing baselines without relying on costly online planning. Visualizations and code are available at [imgeorgiev.com/pwm](https://www.imgeorgiev.com/pwm/).

Cite

Text

Georgiev et al. "PWM: Policy Learning with Multi-Task World Models." International Conference on Learning Representations, 2025.

Markdown

[Georgiev et al. "PWM: Policy Learning with Multi-Task World Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/georgiev2025iclr-pwm/)

BibTeX

@inproceedings{georgiev2025iclr-pwm,
  title     = {{PWM: Policy Learning with Multi-Task World Models}},
  author    = {Georgiev, Ignat and Giridhar, Varun and Hansen, Nicklas and Garg, Animesh},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/georgiev2025iclr-pwm/}
}