Balanced and Token-Efficient Summarization of User Reviews via Stratified Sampling and Large Language Models

Abstract

User-generated reviews offer valuable insights into consumer experiences, preferences, and concerns. They provide direct feedback on product perception and improvements while helping users evaluate strengths, weaknesses, and alternatives. Advanced machine learning techniques, including LLMs like BERT and GPT, enhance the extraction of meaningful information from these vast datasets. This paper introduces a framework leveraging Large Language Models (LLMs) to generate high-quality summaries using minimal input tokens. By employing multidimensional classification (sentiment, topics, emotion) combined with a stratified sampling approach, our framework selects a compact yet comprehensive subset of reviews that accurately represents the original dataset. Tailored prompts guide the LLMs to create balanced summaries that fairly represent both strengths and weaknesses. Experiments on Amazon and Tripadvisor datasets demonstrate that our method significantly reduces token usage and computational costs, while consistently outperforming traditional AI-based summarization approaches in terms of content coverage, balance, and semantic accuracy.

Cite

Text

Marozzo et al. "Balanced and Token-Efficient Summarization of User Reviews via Stratified Sampling and Large Language Models." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2025. doi:10.1007/978-3-032-06078-5_17

Markdown

[Marozzo et al. "Balanced and Token-Efficient Summarization of User Reviews via Stratified Sampling and Large Language Models." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2025.](https://mlanthology.org/ecmlpkdd/2025/marozzo2025ecmlpkdd-balanced/) doi:10.1007/978-3-032-06078-5_17

BibTeX

@inproceedings{marozzo2025ecmlpkdd-balanced,
  title     = {{Balanced and Token-Efficient Summarization of User Reviews via Stratified Sampling and Large Language Models}},
  author    = {Marozzo, Fabrizio and Belcastro, Loris and Cosentino, Cristian and Liò, Pietro},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2025},
  pages     = {290-306},
  doi       = {10.1007/978-3-032-06078-5_17},
  url       = {https://mlanthology.org/ecmlpkdd/2025/marozzo2025ecmlpkdd-balanced/}
}