Performance Evaluation in Machine Learning: The Good, the Bad, the Ugly, and the Way Forward

Abstract

This paper gives an overview of some ways in which our understanding of performance evaluation measures for machine-learned classifiers has improved over the last twenty years. I also highlight a range of areas where this understanding is still lacking, leading to ill-advised practices in classifier evaluation. This suggests that in order to make further progress we need to develop a proper measurement theory of machine learning. I then demonstrate by example what such a measurement theory might look like and what kinds of new results it would entail. Finally, I argue that key properties such as classification ability and data set difficulty are unlikely to be directly observable, suggesting the need for latent-variable models and causal inference.

Cite

Text

Flach. "Performance Evaluation in Machine Learning: The Good, the Bad, the Ugly, and the Way Forward." AAAI Conference on Artificial Intelligence, 2019. doi:10.1609/AAAI.V33I01.33019808

Markdown

[Flach. "Performance Evaluation in Machine Learning: The Good, the Bad, the Ugly, and the Way Forward." AAAI Conference on Artificial Intelligence, 2019.](https://mlanthology.org/aaai/2019/flach2019aaai-performance/) doi:10.1609/AAAI.V33I01.33019808

BibTeX

@inproceedings{flach2019aaai-performance,
  title     = {{Performance Evaluation in Machine Learning: The Good, the Bad, the Ugly, and the Way Forward}},
  author    = {Flach, Peter A.},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2019},
  pages     = {9808-9814},
  doi       = {10.1609/AAAI.V33I01.33019808},
  url       = {https://mlanthology.org/aaai/2019/flach2019aaai-performance/}
}