Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

Abstract

Synthetic data augmentation via Large Language Models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring about deficient results while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs using merely a tiny amount of real-world data. We empirically assessed the effectiveness of our methods on multiple text classification tasks, and the results showed that leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator.

Cite

Text

Kuo et al. "Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification." International Conference on Learning Representations, 2025.

Markdown

[Kuo et al. "Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/kuo2025iclr-all/)

BibTeX

@inproceedings{kuo2025iclr-all,
  title     = {{Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification}},
  author    = {Kuo, Hsun-Yu and Liao, Yin-Hsiang and Chao, Yu-Chieh and Ma, Wei-Yun and Cheng, Pu-Jen},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/kuo2025iclr-all/}
}