D2 Actor Critic: Diffusion Actor Meets Distributional Critic
Abstract
We introduce D2AC, a new model-free reinforcement learning (RL) algorithm designed to train expressive diffusion policies online effectively. At its core is a policy improvement objective that avoids the high variance of typical policy gradients and the complexity of backpropagation through time. This stable learning process is critically enabled by our second contribution: a robust distributional critic, which we design through a fusion of distributional RL and clipped double Q-learning. The resulting algorithm is highly effective, achieving state-of-the-art performance on a benchmark of eighteen hard RL tasks, including Humanoid, Dog, and Shadow Hand domains, spanning both dense-reward and goal-conditioned RL scenarios. Beyond standard benchmarks, we also evaluate a biologically motivated predator-prey task to examine the behavioral robustness and generalization capacity of our approach.
Cite
Text
Zhang et al. "D2 Actor Critic: Diffusion Actor Meets Distributional Critic." Transactions on Machine Learning Research, 2025.Markdown
[Zhang et al. "D2 Actor Critic: Diffusion Actor Meets Distributional Critic." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/zhang2025tmlr-d2/)BibTeX
@article{zhang2025tmlr-d2,
title = {{D2 Actor Critic: Diffusion Actor Meets Distributional Critic}},
author = {Zhang, Lunjun and Han, Shuo and Lyu, Hanrui and Stadie, Bradly C.},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/zhang2025tmlr-d2/}
}