Consistent Noisy Latent Rewards for Trajectory Preference Optimization in Diffusion Models

Abstract

Recent advances in diffusion models for visual generation have sparked interest in human preference alignment, similar to developments in Large Language Models. While reward model (RM) based approaches enable trajectory-aware optimization by evaluating intermediate timesteps, they face two critical challenges: unreliable reward estimation on noisy latents due to pixel-level models' sensitivity to noise interference, and single-timestep preference evaluation across sampling trajectories where single-timestep evaluations can yield inconsistent preference rankings depending on the selected timestep. To address these limitations, we propose a comprehensive framework with targeted solutions for each challenge. To achieve noise compatibility for reliable reward estimation, we introduce the Score-based Latent Reward Model (SLRM), which leverages the complete diffusion model as a preference discriminator with learnable task tokens and a score enhancement mechanism that explicitly preserves noise compatibility by augmenting preference logits with the denoising score function. To ensure consistent preference evaluation across trajectories, we develop Trajectory Advantages Preference Optimization (TAPO), which strategically performs Stochastic Differential Equations sampling and reward evaluation at multiple timesteps to dynamically capture trajectory advantages while identifying preference inconsistencies and preventing erroneous trajectory selection. Extensive experiments on Text-to-Image and Text-to-Video generation tasks demonstrate significant improvements on noisy latent evaluation and alignment performance.

Cite

Text

Xian et al. "Consistent Noisy Latent Rewards for Trajectory Preference Optimization in Diffusion Models." International Conference on Learning Representations, 2026.

Markdown

[Xian et al. "Consistent Noisy Latent Rewards for Trajectory Preference Optimization in Diffusion Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/xian2026iclr-consistent/)

BibTeX

@inproceedings{xian2026iclr-consistent,
  title     = {{Consistent Noisy Latent Rewards for Trajectory Preference Optimization in Diffusion Models}},
  author    = {Xian, Xiaole and He, Xilin and Chen, Wenting and Liu, Wenshuang and Mu, Wenqi and He, Yancheng and Li, Liang and Zhang, Yi and Yue, Xiangyu},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/xian2026iclr-consistent/}
}