Feature Encodings for Gradient Boosting with Automunge
Abstract
Automunge is a tabular preprocessing library that encodes dataframes for supervised learning. When selecting a default feature encoding strategy for gradient boosted learning, one may consider metrics of training duration and achieved predictive performance associated with the feature representations. Automunge offers a default of binarization for categoric features and z-score normalization for numeric. The presented study sought to validate those defaults by way of benchmarking on a series of diverse data sets by encoding variations with tuned gradient boosted learning. We found that on average our chosen defaults were top performers both from a tuning duration and a model performance standpoint. Another key finding was that one hot encoding did not perform in a manner consistent with suitability to serve as a categoric default in comparison to categoric binarization. We present here these and further benchmarks.
Cite
Text
Teague. "Feature Encodings for Gradient Boosting with Automunge." NeurIPS 2022 Workshops: HITY, 2022.Markdown
[Teague. "Feature Encodings for Gradient Boosting with Automunge." NeurIPS 2022 Workshops: HITY, 2022.](https://mlanthology.org/neuripsw/2022/teague2022neuripsw-feature/)BibTeX
@inproceedings{teague2022neuripsw-feature,
title = {{Feature Encodings for Gradient Boosting with Automunge}},
author = {Teague, Nicholas},
booktitle = {NeurIPS 2022 Workshops: HITY},
year = {2022},
url = {https://mlanthology.org/neuripsw/2022/teague2022neuripsw-feature/}
}