Datamodels: Understanding Predictions with Data and Data with Predictions

Abstract

We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. For any fixed “target” example $x$, training set $S$, and learning algorithm, a datamodel is a parameterized function $2^S \to \mathbb{R}$ that for any subset of $S’ \subset S$—using only information about which examples of $S$ are contained in $S’$—predicts the outcome of training a model on $S’$ and evaluating on $x$. Despite the complexity of the underlying process being approximated (e.g. end-to-end training and evaluation of deep neural networks), we show that even simple linear datamodels successfully predict model outputs. We then demonstrate that datamodels give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space.

Cite

Text

Ilyas et al. "Datamodels: Understanding Predictions with Data and Data with Predictions." International Conference on Machine Learning, 2022.

Markdown

[Ilyas et al. "Datamodels: Understanding Predictions with Data and Data with Predictions." International Conference on Machine Learning, 2022.](https://mlanthology.org/icml/2022/ilyas2022icml-datamodels/)

BibTeX

@inproceedings{ilyas2022icml-datamodels,
  title     = {{Datamodels: Understanding Predictions with Data and Data with Predictions}},
  author    = {Ilyas, Andrew and Park, Sung Min and Engstrom, Logan and Leclerc, Guillaume and Madry, Aleksander},
  booktitle = {International Conference on Machine Learning},
  year      = {2022},
  pages     = {9525-9587},
  volume    = {162},
  url       = {https://mlanthology.org/icml/2022/ilyas2022icml-datamodels/}
}