Dong, Hanze
25 publications
NeurIPS
2025
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL
NeurIPS
2024
Online Iterative Reinforcement Learning from Human Feedback with General Preference Model
25 publications