Enhancing LLMs via High-Knowledge Data Selection

Abstract

The performance of Large Language Models (LLMs) is intrinsically linked to the quality of its training data. Although several studies have proposed methods for high-quality data selection, they do not consider the importance of knowledge richness in text corpora. In this paper, we propose a novel and gradient-free High-Knowledge Scorer (HKS) to select high-quality data from the dimension of knowledge, to alleviate the problem of knowledge scarcity in the pre-trained corpus. We propose a comprehensive multi-domain knowledge element pool and introduce knowledge density and coverage as metrics to assess the knowledge content of the text. Based on this, we propose a comprehensive knowledge scorer to select data with intensive knowledge, which can also be utilized for domain-specific high-knowledge data selection by restricting knowledge elements to the specific domain. We train models on a high-knowledge bilingual dataset, and experimental results demonstrate that our scorer improves the model's performance in knowledge-intensive and general comprehension tasks, and is effective in enhancing both the generic and domain-specific capabilities of the model.

Cite

Text

Duan et al. "Enhancing LLMs via High-Knowledge Data Selection." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I22.34555

Markdown

[Duan et al. "Enhancing LLMs via High-Knowledge Data Selection." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/duan2025aaai-enhancing/) doi:10.1609/AAAI.V39I22.34555

BibTeX

@inproceedings{duan2025aaai-enhancing,
  title     = {{Enhancing LLMs via High-Knowledge Data Selection}},
  author    = {Duan, Feiyu and Zhang, Xuemiao and Wang, Sirui and Que, Haoran and Liu, Yuqi and Rong, Wenge and Cai, Xunliang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {23832-23840},
  doi       = {10.1609/AAAI.V39I22.34555},
  url       = {https://mlanthology.org/aaai/2025/duan2025aaai-enhancing/}
}