Ab Initio Nonparametric Variable Selection for Scalable Symbolic Regression with Large $p$

ICML 2025 pp. 72041-72062

Abstract

Symbolic regression (SR) is a powerful technique for discovering symbolic expressions that characterize nonlinear relationships in data, gaining increasing attention for its interpretability, compactness, and robustness. However, existing SR methods do not scale to datasets with a large number of input variables (referred to as extreme-scale SR), which is common in modern scientific applications. This "large $p$” setting, often accompanied by measurement error, leads to slow performance of SR methods and overly complex expressions that are difficult to interpret. To address this scalability challenge, we propose a method called PAN+SR, which combines a key idea of ab initio nonparametric variable selection with SR to efficiently pre-screen large input spaces and reduce search complexity while maintaining accuracy. The use of nonparametric methods eliminates model misspecification, supporting a strategy called parametric-assisted nonparametric (PAN). We also extend SRBench, an open-source benchmarking platform, by incorporating high-dimensional regression problems with various signal-to-noise ratios. Our results demonstrate that PAN+SR consistently enhances the performance of 19 contemporary SR methods, enabling several to achieve state-of-the-art performance on these challenging datasets.

Cite

Text

Ye and Li. "Ab Initio Nonparametric Variable Selection for Scalable Symbolic Regression with Large $p$." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Ye and Li. "Ab Initio Nonparametric Variable Selection for Scalable Symbolic Regression with Large $p$." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/ye2025icml-ab/)

BibTeX

@inproceedings{ye2025icml-ab,
  title     = {{Ab Initio Nonparametric Variable Selection for Scalable Symbolic Regression with Large $p$}},
  author    = {Ye, Shengbin and Li, Meng},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {72041-72062},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/ye2025icml-ab/}
}