Decision Theoretic Subsampling for Induction on Large Databases

Abstract

Standard methods of constructing decision trees can be prohibitively expensive when induction algorithms are given very large training sets on which to compare attributes. This expense can often be avoided. By using a subsample for this calculation, we can get an approximation to the information gain used to assess each attribute. Selecting an attribute on this basis involves a certain risk, which depends on the subsample size and the information gain values observed. This paper addresses the questions of assessing when the choice may be made with a given expected error, and determining a sampling strategy that minimizes the computation cost of making it. The theory is entirely analogous to that used for decision-theoretic control of search. An empirical evaluation shows that an implementation performs well over an appropriately wide range of inputs on two fundamental tasks: deciding whether a choice of attribute can be safely made on a given sample, and if not, estimating how large a subsample is likely to be required to do so.

Cite

Text

Musick et al. "Decision Theoretic Subsampling for Induction on Large Databases." International Conference on Machine Learning, 1993. doi:10.1016/B978-1-55860-307-3.50034-4

Markdown

[Musick et al. "Decision Theoretic Subsampling for Induction on Large Databases." International Conference on Machine Learning, 1993.](https://mlanthology.org/icml/1993/musick1993icml-decision/) doi:10.1016/B978-1-55860-307-3.50034-4

BibTeX

@inproceedings{musick1993icml-decision,
  title     = {{Decision Theoretic Subsampling for Induction on Large Databases}},
  author    = {Musick, Ron and Catlett, Jason and Russell, Stuart},
  booktitle = {International Conference on Machine Learning},
  year      = {1993},
  pages     = {212-219},
  doi       = {10.1016/B978-1-55860-307-3.50034-4},
  url       = {https://mlanthology.org/icml/1993/musick1993icml-decision/}
}