Automating Evaluation of Creativity in LLMs with Semantic Entropy and Efficient Multi-Agent Judge

Abstract

Large Language Models (LLMs) have achieved remarkable progress in natural language comprehension, reasoning, and generation, sparking interest in their creative potential. Automating creativity evaluation in LLMs, particularly in physical reasoning tasks, presents a transformative opportunity to accelerate scientific discovery by enabling innovative solutions, uncovering patterns, and automating problem-solving processes. Current creativity evaluation frameworks, however, rely heavily on human annotation, making them subjective, resource-intensive, and impractical for scaling. To address this, we introduce a novel automated evaluation framework rooted in cognitive science principles of divergent and convergent thinking. Divergent creativity is measured using Semantic Entropy, a sampling-based metric that quantifies variability in generated outputs to capture the novelty of ideas. Convergent creativity is assessed using a modified retrieval-based discussion framework—60% more efficient—where autonomous multi-agent systems evaluate task solutions across feasibility, safety, and effectiveness. We implement these methodologies within a benchmark based on the MacGyver dataset, which contains 300 real-world, solvable problems requiring innovative use of everyday objects. Our framework evaluates state-of-the-art LLMs, such as GPT and LLaMA models, while analyzing the effects of key parameters like temperature, model size, and recency. By automating creativity evaluation, we establish a scalable, objective, and reproducible methodology to enhance LLM development, paving the way for breakthroughs in scientific discovery and creative problem-solving across diverse fields.

Cite

Text

Sen et al. "Automating Evaluation of Creativity in LLMs with Semantic Entropy and Efficient Multi-Agent Judge." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.

Markdown

[Sen et al. "Automating Evaluation of Creativity in LLMs with Semantic Entropy and Efficient Multi-Agent Judge." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.](https://mlanthology.org/iclrw/2025/sen2025iclrw-automating/)

BibTeX

@inproceedings{sen2025iclrw-automating,
  title     = {{Automating Evaluation of Creativity in LLMs with Semantic Entropy and Efficient Multi-Agent Judge}},
  author    = {Sen, Tan Min and Chun, Zachary Choy Kit and Saikia, Swaagat Bikash and Alsagoff, Syed Ali Redha and Mohor, Banerjee and Wangsajaya, Nadya Yuki and Chan, Alvin},
  booktitle = {ICLR 2025 Workshops: LLM_Reason_and_Plan},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/sen2025iclrw-automating/}
}