Improving Data Efficiency via Curating LLM-Driven Rating Systems
Abstract
Instruction tuning is critical for adapting large language models (LLMs) to downstream tasks, and recent studies have demonstrated that small amounts of human-curated data can outperform larger datasets, challenging traditional data scaling laws. While LLM-based data quality rating systems offer a cost-effective alternative to human annotation, they often suffer from inaccuracies and biases, even in powerful models like GPT-4. In this work, we introduce $DS^2$, a **D**iversity-aware **S**core curation method for **D**ata **S**election. By systematically modeling error patterns through a score transition matrix, $DS^2$ corrects LLM-based scores and promotes diversity in the selected data samples. Our approach shows that a curated subset (just 3.3\% of the original dataset) outperforms full-scale datasets (300k samples) across various machine-alignment benchmarks, and matches or surpasses human-aligned datasets such as LIMA with the same sample size (1k samples). These findings challenge conventional data scaling assumptions, highlighting that redundant, low-quality samples can degrade performance and reaffirming that ``more can be less''.
Cite
Text
Pang et al. "Improving Data Efficiency via Curating LLM-Driven Rating Systems." International Conference on Learning Representations, 2025.Markdown
[Pang et al. "Improving Data Efficiency via Curating LLM-Driven Rating Systems." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/pang2025iclr-improving/)BibTeX
@inproceedings{pang2025iclr-improving,
title = {{Improving Data Efficiency via Curating LLM-Driven Rating Systems}},
author = {Pang, Jinlong and Wei, Jiaheng and Shah, Ankit and Zhu, Zhaowei and Wang, Yaxuan and Qian, Chen and Liu, Yang and Bao, Yujia and Wei, Wei},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/pang2025iclr-improving/}
}