Supervised and Unsupervised Discretization of Continuous Features

Abstract

Many supervised machine learning algorithms require a discrete feature space. In this paper, we review previous work on continuous feature discretization, identify defining characteristics of the methods, and conduct an empirical evaluation of several methods. We compare binning, an unsupervised discretization method, to entropy-based and purity-based methods, which are supervised algorithms. We found that the performance of the Naive-Bayes algorithm significantly improved when features were discretized using an entropy-based method. In fact, over the 16 tested datasets, the discretized version of Naive-Bayes slightly outperformed C4.5 on average. We also show that in some cases, the performance of the C4.5 induction algorithm significantly improved if features were discretized in advance; in our experiments, the performance never significantly degraded, an interesting phenomenon considering the fact that C4.5 is capable of locally discretizing features.

Cite

Text

Dougherty et al. "Supervised and Unsupervised Discretization of Continuous Features." International Conference on Machine Learning, 1995. doi:10.1016/B978-1-55860-377-6.50032-3

Markdown

[Dougherty et al. "Supervised and Unsupervised Discretization of Continuous Features." International Conference on Machine Learning, 1995.](https://mlanthology.org/icml/1995/dougherty1995icml-supervised/) doi:10.1016/B978-1-55860-377-6.50032-3

BibTeX

@inproceedings{dougherty1995icml-supervised,
  title     = {{Supervised and Unsupervised Discretization of Continuous Features}},
  author    = {Dougherty, James and Kohavi, Ron and Sahami, Mehran},
  booktitle = {International Conference on Machine Learning},
  year      = {1995},
  pages     = {194-202},
  doi       = {10.1016/B978-1-55860-377-6.50032-3},
  url       = {https://mlanthology.org/icml/1995/dougherty1995icml-supervised/}
}