Evaluating Human-Language Model Interaction

Abstract

Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.

Cite

Text

Lee et al. "Evaluating Human-Language Model Interaction." Transactions on Machine Learning Research, 2023.

Markdown

[Lee et al. "Evaluating Human-Language Model Interaction." Transactions on Machine Learning Research, 2023.](https://mlanthology.org/tmlr/2023/lee2023tmlr-evaluating/)

BibTeX

@article{lee2023tmlr-evaluating,
  title     = {{Evaluating Human-Language Model Interaction}},
  author    = {Lee, Mina and Srivastava, Megha and Hardy, Amelia and Thickstun, John and Durmus, Esin and Paranjape, Ashwin and Gerard-Ursin, Ines and Li, Xiang Lisa and Ladhak, Faisal and Rong, Frieda and Wang, Rose E and Kwon, Minae and Park, Joon Sung and Cao, Hancheng and Lee, Tony and Bommasani, Rishi and Bernstein, Michael S. and Liang, Percy},
  journal   = {Transactions on Machine Learning Research},
  year      = {2023},
  url       = {https://mlanthology.org/tmlr/2023/lee2023tmlr-evaluating/}
}