Feature Selection for Unbalanced Class Distribution and Naive Bayes
Abstract
This paper describes an approach to feature subset selection that takes into account problem specifics and learning algorithm characteristics. It is developed for the Naive Bayesian classifier applied on text data, since it combines well with the addressed learning problems. We focus on domains with many features that also have a highly unbalanced class distribution and asymmetric misclassification costs given only implicitly in the problem. By asymmetric misclassification costs we mean that one of the class values is the target class value for which we want to get predictions and we prefer false positive over false negative. Our example problem is automatic document categorization using machine learning, where we want to identify documents relevant for the selected category. Usually, only about 1%-10% of examples belong to the selected category. Our experimental comparison of eleven feature scoring measures show that considering domain and algorithm characteristics significantly impro...
Cite
Text
Mladenic and Grobelnik. "Feature Selection for Unbalanced Class Distribution and Naive Bayes." International Conference on Machine Learning, 1999.Markdown
[Mladenic and Grobelnik. "Feature Selection for Unbalanced Class Distribution and Naive Bayes." International Conference on Machine Learning, 1999.](https://mlanthology.org/icml/1999/mladenic1999icml-feature/)BibTeX
@inproceedings{mladenic1999icml-feature,
title = {{Feature Selection for Unbalanced Class Distribution and Naive Bayes}},
author = {Mladenic, Dunja and Grobelnik, Marko},
booktitle = {International Conference on Machine Learning},
year = {1999},
pages = {258-267},
url = {https://mlanthology.org/icml/1999/mladenic1999icml-feature/}
}