EvalAssist: LLM-as-a-Judge Simplified

Abstract

We present EvalAssist, a framework that simplifies the LLM- as-a-judge workflow. The system provides an online criteria development environment, where users can interactively build, test, and share custom evaluation criteria in a structured and portable format. A library of LLM based evaluators is made available that incorporates various algorithmic innovations such as token-probability based judgement, positional bias checking, and certainty estimation that help to engender trust in the evaluation process. We have computed extensive benchmarks and also deployed the system internally in our organization with several hundreds of users.

Cite

Text

Desmond et al. "EvalAssist: LLM-as-a-Judge Simplified." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I28.35351

Markdown

[Desmond et al. "EvalAssist: LLM-as-a-Judge Simplified." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/desmond2025aaai-evalassist/) doi:10.1609/AAAI.V39I28.35351

BibTeX

@inproceedings{desmond2025aaai-evalassist,
  title     = {{EvalAssist: LLM-as-a-Judge Simplified}},
  author    = {Desmond, Michael and Ashktorab, Zahra and Geyer, Werner and Daly, Elizabeth M. and Cooper, Martín Santillán and Pan, Qian and Nair, Rahul and Wagner, Nico and Pedapati, Tejaswini},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {29637-29639},
  doi       = {10.1609/AAAI.V39I28.35351},
  url       = {https://mlanthology.org/aaai/2025/desmond2025aaai-evalassist/}
}