Feature Encodings for Gradient Boosting with Automunge

Teague, Nicholas

Feature Encodings for Gradient Boosting with Automunge

NeurIPSW 2022

/neuripsw/2022/teague2022neuripsw-feature/

Abstract

Automunge is a tabular preprocessing library that encodes dataframes for supervised learning. When selecting a default feature encoding strategy for gradient boosted learning, one may consider metrics of training duration and achieved predictive performance associated with the feature representations. Automunge offers a default of binarization for categoric features and z-score normalization for numeric. The presented study sought to validate those defaults by way of benchmarking on a series of diverse data sets by encoding variations with tuned gradient boosted learning. We found that on average our chosen defaults were top performers both from a tuning duration and a model performance standpoint. Another key finding was that one hot encoding did not perform in a manner consistent with suitability to serve as a categoric default in comparison to categoric binarization. We present here these and further benchmarks.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Teague. "Feature Encodings for Gradient Boosting with Automunge." NeurIPS 2022 Workshops: HITY, 2022.

Markdown

[Teague. "Feature Encodings for Gradient Boosting with Automunge." NeurIPS 2022 Workshops: HITY, 2022.](https://mlanthology.org/neuripsw/2022/teague2022neuripsw-feature/)

BibTeX

@inproceedings{teague2022neuripsw-feature,
  title     = {{Feature Encodings for Gradient Boosting with Automunge}},
  author    = {Teague, Nicholas},
  booktitle = {NeurIPS 2022 Workshops: HITY},
  year      = {2022},
  url       = {https://mlanthology.org/neuripsw/2022/teague2022neuripsw-feature/}
}