Wasserstein Policy Optimization

Abstract

We introduce Wasserstein Policy Optimization (WPO), an actor-critic algorithm for reinforcement learning in continuous action spaces. WPO can be derived as an approximation to Wasserstein gradient flow over the space of all policies projected into a finite-dimensional parameter space (e.g., the weights of a neural network), leading to a simple and completely general closed-form update. The resulting algorithm combines many properties of deterministic and classic policy gradient methods. Like deterministic policy gradients, it exploits knowledge of the gradient of the action-value function with respect to the action. Like classic policy gradients, it can be applied to stochastic policies with arbitrary distributions over actions – without using the reparameterization trick. We show results on the DeepMind Control Suite and a magnetic confinement fusion task which compare favorably with state-of-the-art continuous control methods.

Cite

Text

Pfau et al. "Wasserstein Policy Optimization." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Pfau et al. "Wasserstein Policy Optimization." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/pfau2025icml-wasserstein/)

BibTeX

@inproceedings{pfau2025icml-wasserstein,
  title     = {{Wasserstein Policy Optimization}},
  author    = {Pfau, David and Davies, Ian and Borsa, Diana L and Araújo, João Guilherme Madeira and Tracey, Brendan Daniel and Van Hasselt, Hado},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {49128-49149},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/pfau2025icml-wasserstein/}
}