Coefficient Tree Regression: Fast, Accurate and Interpretable Predictive Modeling

Abstract

The proliferation of data collection technologies often results in large data sets with many observations and many variables. In practice, highly relevant engineered features are often groups of predictors that share a common regression coefficient (i.e., the predictors in the group affect the response only via their collective sum), where the groups are unknown in advance and must be discovered from the data. We propose an algorithm called coefficient tree regression (CTR) to discover the group structure and fit the resulting regression model. In this regard CTR is an automated way of engineering new features, each of which is the collective sum of the predictors within each group. The algorithm can be used when the number of variables is larger than, or smaller than, the number of observations. Creating new features that affect the response in a similar manner improves predictive modeling, especially in domains where the relationships between predictors are not known a priori. CTR borrows computational strategies from both linear regression (fast model updating when adding/modifying a feature in the model) and regression trees (fast partitioning to form and split groups) to achieve outstanding computational and predictive performance. Finding features that represent hidden groups of predictors (i.e., a hidden ontology) that impact the response only via their sum also has major interpretability advantages, which we demonstrate with a real data example of predicting political affiliations with television viewing habits. In numerical comparisons over a variety of examples, we demonstrate that both computational expense and predictive performance are far superior to existing methods that create features as groups of predictors. Moreover, CTR has overall predictive performance that is comparable to or slightly better than the regular lasso method, which we include as a reference benchmark for comparison even though it is non-group-based, in addition to having substantial computational and interpretive advantages over lasso.

Cite

Text

Sürer et al. "Coefficient Tree Regression: Fast, Accurate and Interpretable Predictive Modeling." Machine Learning, 2024. doi:10.1007/S10994-021-06091-7

Markdown

[Sürer et al. "Coefficient Tree Regression: Fast, Accurate and Interpretable Predictive Modeling." Machine Learning, 2024.](https://mlanthology.org/mlj/2024/surer2024mlj-coefficient/) doi:10.1007/S10994-021-06091-7

BibTeX

@article{surer2024mlj-coefficient,
  title     = {{Coefficient Tree Regression: Fast, Accurate and Interpretable Predictive Modeling}},
  author    = {Sürer, Özge and Apley, Daniel W. and Malthouse, Edward C.},
  journal   = {Machine Learning},
  year      = {2024},
  pages     = {4723-4759},
  doi       = {10.1007/S10994-021-06091-7},
  volume    = {113},
  url       = {https://mlanthology.org/mlj/2024/surer2024mlj-coefficient/}
}