Trustworthy Model Evaluation on a Budget

Abstract

Standard practice in Machine Learning (ML) research uses ablation studies to evaluate a novel method. We find that errors in the ablation setup can lead to incorrect explanations for which method components contribute to the performance. Previous work has shown that the majority of experiments published in top conferences are performed with few experimental trials (less than 50) and manual sampling of hyperparameters. Using the insights from our meta-analysis, we demonstrate how current practices can lead to unreliable conclusions. We simulate an ablation study experiment on an existing Neural Architecture Search (NAS) benchmark and perform an ablation study with 120 trials using ResNet50. We quantify the selection bias of Hyperparameter Optimization (HPO) strategies to show that only random sampling can produce reliable results when determining the top and mean performance of a method under a limited computational budget.

Cite

Text

Fostiropoulos et al. "Trustworthy Model Evaluation on a Budget." ICLR 2023 Workshops: RTML, 2023.

Markdown

[Fostiropoulos et al. "Trustworthy Model Evaluation on a Budget." ICLR 2023 Workshops: RTML, 2023.](https://mlanthology.org/iclrw/2023/fostiropoulos2023iclrw-trustworthy/)

BibTeX

@inproceedings{fostiropoulos2023iclrw-trustworthy,
  title     = {{Trustworthy Model Evaluation on a Budget}},
  author    = {Fostiropoulos, Iordanis and Brown, Bowman Noah and Itti, Laurent},
  booktitle = {ICLR 2023 Workshops: RTML},
  year      = {2023},
  url       = {https://mlanthology.org/iclrw/2023/fostiropoulos2023iclrw-trustworthy/}
}