DuPO: Enabling Reliable Self-Verification via Dual Preference Optimization

She, Shuaijie; Bao, Yu; Lu, Yu; Xu, Lu; Li, Tao; Zhu, Wenhao; Zhang, Jianbing; Huang, Shujian; Cheng, Shanbo; Lu, Lu; Wang, Yuxuan

DuPO: Enabling Reliable Self-Verification via Dual Preference Optimization

Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li, Wenhao Zhu, Jianbing Zhang, Shujian Huang, Shanbo Cheng, Lu Lu, Yuxuan Wang

ICLR 2026

/iclr/2026/she2026iclr-dupo/

Abstract

We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)’s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning’s restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task’s input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs’ ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.1 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on four challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker~(trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

She et al. "DuPO: Enabling Reliable Self-Verification via Dual Preference Optimization." International Conference on Learning Representations, 2026.

Markdown

[She et al. "DuPO: Enabling Reliable Self-Verification via Dual Preference Optimization." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/she2026iclr-dupo/)

BibTeX

@inproceedings{she2026iclr-dupo,
  title     = {{DuPO: Enabling Reliable Self-Verification via Dual Preference Optimization}},
  author    = {She, Shuaijie and Bao, Yu and Lu, Yu and Xu, Lu and Li, Tao and Zhu, Wenhao and Zhang, Jianbing and Huang, Shujian and Cheng, Shanbo and Lu, Lu and Wang, Yuxuan},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/she2026iclr-dupo/}
}