DIME: An Information-Theoretic Difficulty Measure for AI Datasets
Abstract
Evaluating the relative difficulty of widely-used benchmark datasets across time and across data modalities is important for accurately measuring progress in machine learning. To help tackle this problem, we propose DIME, an information-theoretic DIfficulty MEasure for datasets, based on Fano’s inequality and a neural network estimation of the conditional entropy of the sample-label distribution. DIME can be decomposed into components attributable to the data distribution and the number of samples. DIME can also compute per-class difficulty scores. Through extensive experiments on both vision and language datasets, we show that DIME is well aligned with empirically observed performance of state-of-the-art machine learning models. We hope that DIME can aid future dataset design and model-training strategies.
Cite
Text
Zhang et al. "DIME: An Information-Theoretic Difficulty Measure for AI Datasets." NeurIPS 2020 Workshops: DL-IG, 2020.Markdown
[Zhang et al. "DIME: An Information-Theoretic Difficulty Measure for AI Datasets." NeurIPS 2020 Workshops: DL-IG, 2020.](https://mlanthology.org/neuripsw/2020/zhang2020neuripsw-dime/)BibTeX
@inproceedings{zhang2020neuripsw-dime,
title = {{DIME: An Information-Theoretic Difficulty Measure for AI Datasets}},
author = {Zhang, Peiliang and Wang, Huan and Naik, Nikhil and Xiong, Caiming and Socher, Richard},
booktitle = {NeurIPS 2020 Workshops: DL-IG},
year = {2020},
url = {https://mlanthology.org/neuripsw/2020/zhang2020neuripsw-dime/}
}