Model Performance Scaling with Multiple Data Sources

Abstract

Real-world machine learning systems are often trained using a mix of data sources with varying cost and quality. Understanding how the size and composition of a training dataset affect model performance is critical for advancing our understanding of generalization, as well as designing more effective data collection policies. We show that there is a simple scaling law that predicts the loss incurred by a model even under varying dataset composition. Our work expands recent observations of scaling laws for log-linear generalization error in the i.i.d setting and uses this to cast model performance prediction as a learning problem. Using the theory of optimal experimental design, we derive a simple rational function approximation to generalization error that can be fitted using a few model training runs. Our approach can achieve highly accurate ($r^2\approx .9$) predictions of model performance under substantial extrapolation in two different standard supervised learning tasks and is accurate ($r^2 \approx .83$) on more challenging machine translation and question answering tasks where many baselines achieve worse-than-random performance.

Cite

Text

Hashimoto. "Model Performance Scaling with Multiple Data Sources." International Conference on Machine Learning, 2021.

Markdown

[Hashimoto. "Model Performance Scaling with Multiple Data Sources." International Conference on Machine Learning, 2021.](https://mlanthology.org/icml/2021/hashimoto2021icml-model/)

BibTeX

@inproceedings{hashimoto2021icml-model,
  title     = {{Model Performance Scaling with Multiple Data Sources}},
  author    = {Hashimoto, Tatsunori},
  booktitle = {International Conference on Machine Learning},
  year      = {2021},
  pages     = {4107-4116},
  volume    = {139},
  url       = {https://mlanthology.org/icml/2021/hashimoto2021icml-model/}
}