Prediction by Categorical Features: Generalization Properties and Application to Feature Ranking

Abstract

We describe and analyze a new approach for feature ranking in the presence of categorical features with a large number of possible values. It is shown that popular ranking criteria, such as the Gini index and the misclassification error, can be interpreted as the training error of a predictor that is deduced from the training set. It is then argued that using the generalization error is a more adequate ranking criterion. We propose a modification of the Gini index criterion, based on a robust estimation of the generalization error of a predictor associated with the Gini index. The properties of this new estimator are analyzed, showing that for most training sets, it produces an accurate estimation of the true generalization error. We then address the question of finding the optimal predictor that is based on a single categorical feature. It is shown that the predictor associated with the misclassification error criterion has the minimal expected generalization error. We bound the bias of this predictor with respect to the generalization error of the Bayes optimal predictor, and analyze its concentration properties.

Cite

Text

Sabato and Shalev-Shwartz. "Prediction by Categorical Features: Generalization Properties and Application to Feature Ranking." Annual Conference on Computational Learning Theory, 2007. doi:10.1007/978-3-540-72927-3_40

Markdown

[Sabato and Shalev-Shwartz. "Prediction by Categorical Features: Generalization Properties and Application to Feature Ranking." Annual Conference on Computational Learning Theory, 2007.](https://mlanthology.org/colt/2007/sabato2007colt-prediction/) doi:10.1007/978-3-540-72927-3_40

BibTeX

@inproceedings{sabato2007colt-prediction,
  title     = {{Prediction by Categorical Features: Generalization Properties and Application to Feature Ranking}},
  author    = {Sabato, Sivan and Shalev-Shwartz, Shai},
  booktitle = {Annual Conference on Computational Learning Theory},
  year      = {2007},
  pages     = {559-573},
  doi       = {10.1007/978-3-540-72927-3_40},
  url       = {https://mlanthology.org/colt/2007/sabato2007colt-prediction/}
}