Dong, Hanze
26 publications
NeurIPS
2025
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL
NeurIPS
2024
Online Iterative Reinforcement Learning from Human Feedback with General Preference Model
26 publications