LLM-Based Feature Generation from Text for Interpretable Machine Learning

Abstract

Traditional text representations like embeddings and bag-of-words hinder rule learning and other interpretable machine learning methods due to high dimensionality and poor comprehensibility. This article investigates using Large Language Models (LLMs) to extract a small number of interpretable text features. We propose two workflows: one fully automated by the LLM (feature proposal and value calculation), and another where users define features and the LLM calculates values. This LLM-based feature extraction enables interpretable rule learning, overcoming issues like spurious interpretability seen with bag-of-words. We evaluated the proposed methods on five diverse datasets (including scientometrics, banking, hate speech, and food hazard). LLM-generated features yielded predictive performance similar to the SciBERT embedding model but used far fewer, interpretable features. Most generated features were considered relevant for the corresponding prediction tasks by human users. We illustrate practical utility on a case study focused on mining recommendation action rules for the improvement of research article quality and citation impact.

Cite

Text

Balek et al. "LLM-Based Feature Generation from Text for Interpretable Machine Learning." Machine Learning, 2025. doi:10.1007/S10994-025-06867-1

Markdown

[Balek et al. "LLM-Based Feature Generation from Text for Interpretable Machine Learning." Machine Learning, 2025.](https://mlanthology.org/mlj/2025/balek2025mlj-llmbased/) doi:10.1007/S10994-025-06867-1

BibTeX

@article{balek2025mlj-llmbased,
  title     = {{LLM-Based Feature Generation from Text for Interpretable Machine Learning}},
  author    = {Balek, Vojtech and Sýkora, Lukás and Sklenák, Vilém and Kliegr, Tomás},
  journal   = {Machine Learning},
  year      = {2025},
  pages     = {241},
  doi       = {10.1007/S10994-025-06867-1},
  volume    = {114},
  url       = {https://mlanthology.org/mlj/2025/balek2025mlj-llmbased/}
}