NaDRO: Leveraging Dual-Reward Strategies for LLMs Training on Noisy Data
Abstract
Group Relative Policy Optimization (GRPO) fine-tuning has demonstrated significant enhancements in reasoning tasks. However, it often relies on high quality labeled dataset, which is typically difficult to obtain. To address this challenge, we introduce \textbf{N}oise-\textbf{A}ware \textbf{D}ual-\textbf{R}eward \textbf{O}ptimization (\textbf{NaDRO}) to effectively enhances the training of Large Language Models (LLMs) under noisy or ambiguous supervision. NaDRO operates through two key components: \textbf{(1) Preference-based Outcome Reward (POR)},which makes a principled bias-variance tradeoff, reducing training variance by learning from robust preference rankings instead of overfitting to single-best estimates; and \textbf{(2) Context Perception Reward (CPR) mechanism}, which ensures that LLMs conduct necessary qualitative assessment of the current problem state to foster deeper situational understanding prior to decision-making. To validate our approach in a realistic decision-making testbed, we model classic combinatorial optimization problems like the Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) as Markov Decision Processes, generating training data via cost-limited exploration. Our results demonstrate that the fine-tuned Qwen 7B and Llama 3.1-8B models achieve statistically robust performance, significantly outperforming leading LLM baselines and standard fine-tuning methods on these complex benchmarks. Code is released at \url{https://github.com/microsoft/HeurAgenix/tree/NaDRO}.
Cite
Text
Qian et al. "NaDRO: Leveraging Dual-Reward Strategies for LLMs Training on Noisy Data." Advances in Neural Information Processing Systems, 2025.Markdown
[Qian et al. "NaDRO: Leveraging Dual-Reward Strategies for LLMs Training on Noisy Data." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/qian2025neurips-nadro/)BibTeX
@inproceedings{qian2025neurips-nadro,
title = {{NaDRO: Leveraging Dual-Reward Strategies for LLMs Training on Noisy Data}},
author = {Qian, Haolong and Yang, Xianliang and Zhang, Ling and Song, Lei and Bian, Jiang and Yuan, Chun},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/qian2025neurips-nadro/}
}