Evaluating Language Models as Risk Scores
Abstract
Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks.Conditioned on a question and answer-key, does the most likely token match the ground truth?Such benchmarks necessarily fail to evaluate LLMs' ability to quantify ground-truth outcome uncertainty.In this work, we focus on the use of LLMs as risk scores for unrealizable prediction tasks.We introduce folktexts, a software package to systematically generate risk scores using LLMs, and evaluate them against US Census data products.A flexible API enables the use of different prompting schemes, local or web-hosted models, and diverse census columns that can be used to compose custom prediction tasks.We evaluate 17 recent LLMs across five proposed benchmark tasks.We find that zero-shot risk scores produced by multiple-choice question-answering have high predictive signal but are widely miscalibrated.Base models consistently overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and produce over-confident risk scores.In fact, instruction-tuning polarizes answer distribution regardless of true underlying data uncertainty.This reveals a general inability of instruction-tuned models to express data uncertainty using multiple-choice answers.A separate experiment using verbalized chat-style risk queries yields substantially improved calibration across instruction-tuned models.These differences in ability to quantify data uncertainty cannot be revealed in realizable settings, and highlight a blind-spot in the current evaluation ecosystem that folktexts covers.
Cite
Text
Cruz et al. "Evaluating Language Models as Risk Scores." Neural Information Processing Systems, 2024. doi:10.52202/079017-3089Markdown
[Cruz et al. "Evaluating Language Models as Risk Scores." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/cruz2024neurips-evaluating/) doi:10.52202/079017-3089BibTeX
@inproceedings{cruz2024neurips-evaluating,
title = {{Evaluating Language Models as Risk Scores}},
author = {Cruz, André F. and Hardt, Moritz and Mendler-Dünner, Celestine},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-3089},
url = {https://mlanthology.org/neurips/2024/cruz2024neurips-evaluating/}
}