FLASK: Fine-Grained Language Model Evaluation Based on Alignment Skill Sets

Abstract

Evaluation of Large Language Models (LLMs) is challenging because instruction-following necessitates alignment with human values and the required set of skills varies depending on the instruction. However, previous studies have mainly focused on coarse-grained evaluation (i.e. overall preference-based evaluation), which limits interpretability since it does not consider the nature of user instructions that require instance-wise skill composition. In this paper, we introduce FLASK (Fine-grained Language Model Evaluation based on Alignment Skill Sets), a fine-grained evaluation protocol for both human-based and model-based evaluation which decomposes coarse-level scoring to a skill set-level scoring for each instruction. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance and increasing the reliability of the evaluation. Using FLASK, we compare multiple open-source and proprietary LLMs and observe a high correlation between model-based and human-based evaluations.

Cite

Text

Ye et al. "FLASK: Fine-Grained Language Model Evaluation Based on Alignment Skill Sets." NeurIPS 2023 Workshops: Instruction, 2023.

Markdown

[Ye et al. "FLASK: Fine-Grained Language Model Evaluation Based on Alignment Skill Sets." NeurIPS 2023 Workshops: Instruction, 2023.](https://mlanthology.org/neuripsw/2023/ye2023neuripsw-flask/)

BibTeX

@inproceedings{ye2023neuripsw-flask,
  title     = {{FLASK: Fine-Grained Language Model Evaluation Based on Alignment Skill Sets}},
  author    = {Ye, Seonghyeon and Kim, Doyoung and Kim, Sungdong and Hwang, Hyeonbin and Kim, Seungone and Jo, Yongrae and Thorne, James and Kim, Juho and Seo, Minjoon},
  booktitle = {NeurIPS 2023 Workshops: Instruction},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/ye2023neuripsw-flask/}
}