Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

Kang, Hyeongyu; Lee, Jaewoo; Shin, Woocheol; Om, Kiyoung; Park, Jinkyoo

Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

Hyeongyu Kang, Jaewoo Lee, Woocheol Shin, Kiyoung Om, Jinkyoo Park

ICLR 2026

/iclr/2026/kang2026iclr-diffusion/

Abstract

Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose Soft Q-based Diffusion Finetuning (SQDF), a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity. Our code is available at https://github.com/Shin-woocheol/SQDF.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Kang et al. "Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function." International Conference on Learning Representations, 2026.

Markdown

[Kang et al. "Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/kang2026iclr-diffusion/)

BibTeX

@inproceedings{kang2026iclr-diffusion,
  title     = {{Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function}},
  author    = {Kang, Hyeongyu and Lee, Jaewoo and Shin, Woocheol and Om, Kiyoung and Park, Jinkyoo},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/kang2026iclr-diffusion/}
}