Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information

Abstract

Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficulty—w.r.t. a model $\mathcal{V}$—as the lack of $\mathcal{V}$-usable information (Xu et al., 2019), where a lower value indicates a more difficult dataset for $\mathcal{V}$. We further introduce pointwise $\mathcal{V}$-information (PVI) for measuring the difficulty of individual instances w.r.t. a given distribution. While standard evaluation metrics typically only compare different models for the same dataset, $\mathcal{V}$-usable information and PVI also permit the converse: for a given model $\mathcal{V}$, we can compare different datasets, as well as different instances/slices of the same dataset. Furthermore, our framework allows for the interpretability of different input attributes via transformations of the input, which we use to discover annotation artefacts in widely-used NLP benchmarks.

Cite

Text

Ethayarajh et al. "Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information." International Conference on Machine Learning, 2022.

Markdown

[Ethayarajh et al. "Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information." International Conference on Machine Learning, 2022.](https://mlanthology.org/icml/2022/ethayarajh2022icml-understanding/)

BibTeX

@inproceedings{ethayarajh2022icml-understanding,
  title     = {{Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information}},
  author    = {Ethayarajh, Kawin and Choi, Yejin and Swayamdipta, Swabha},
  booktitle = {International Conference on Machine Learning},
  year      = {2022},
  pages     = {5988-6008},
  volume    = {162},
  url       = {https://mlanthology.org/icml/2022/ethayarajh2022icml-understanding/}
}