Bagging Is an Optimal PAC Learner

COLT 2023 pp. 450-468

/colt/2023/larsen2023colt-bagging/

Abstract

Determining the optimal sample complexity of PAC learning in the realizable setting was a central open problem in learning theory for decades. Finally, the seminal work by Hanneke (2016) gave an algorithm with a provably optimal sample complexity. His algorithm is based on a careful and structured sub-sampling of the training data and then returning a majority vote among hypotheses trained on each of the sub-samples. While being a very exciting theoretical result, it has not had much impact in practice, in part due to inefficiency, since it constructs a polynomial number of sub-samples of the training data, each of linear size.In this work, we prove the surprising result that the practical and classic heuristic \emph{bagging} (a.k.a. bootstrap aggregation), due to Breiman (1996), is in fact also an optimal PAC learner. Bagging pre-dates Hanneke’s algorithm by twenty years and is taught in most undergraduate machine learning courses. Moreover, we show that it only requires a logarithmic number of sub-samples to reach optimality.

PDF COLT Semantic Scholar

Cite

Text

Larsen. "Bagging Is an Optimal PAC Learner." Conference on Learning Theory, 2023.

Markdown

[Larsen. "Bagging Is an Optimal PAC Learner." Conference on Learning Theory, 2023.](https://mlanthology.org/colt/2023/larsen2023colt-bagging/)

BibTeX

@inproceedings{larsen2023colt-bagging,
  title     = {{Bagging Is an Optimal PAC Learner}},
  author    = {Larsen, Kasper Green},
  booktitle = {Conference on Learning Theory},
  year      = {2023},
  pages     = {450-468},
  volume    = {195},
  url       = {https://mlanthology.org/colt/2023/larsen2023colt-bagging/}
}