Datamodels: Understanding Predictions with Data and Data with Predictions
Abstract
We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. For any fixed “target” example $x$, training set $S$, and learning algorithm, a datamodel is a parameterized function $2^S \to \mathbb{R}$ that for any subset of $S’ \subset S$—using only information about which examples of $S$ are contained in $S’$—predicts the outcome of training a model on $S’$ and evaluating on $x$. Despite the complexity of the underlying process being approximated (e.g. end-to-end training and evaluation of deep neural networks), we show that even simple linear datamodels successfully predict model outputs. We then demonstrate that datamodels give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space.
Cite
Text
Ilyas et al. "Datamodels: Understanding Predictions with Data and Data with Predictions." International Conference on Machine Learning, 2022.Markdown
[Ilyas et al. "Datamodels: Understanding Predictions with Data and Data with Predictions." International Conference on Machine Learning, 2022.](https://mlanthology.org/icml/2022/ilyas2022icml-datamodels/)BibTeX
@inproceedings{ilyas2022icml-datamodels,
title = {{Datamodels: Understanding Predictions with Data and Data with Predictions}},
author = {Ilyas, Andrew and Park, Sung Min and Engstrom, Logan and Leclerc, Guillaume and Madry, Aleksander},
booktitle = {International Conference on Machine Learning},
year = {2022},
pages = {9525-9587},
volume = {162},
url = {https://mlanthology.org/icml/2022/ilyas2022icml-datamodels/}
}