Language Models Are Good Tabular Learners

Zhenhan Huang, Kavitha Srinivas, Horst Samulowitz, Niharika S. D'Souza, Charu C. Aggarwal, Pin-Yu Chen, Jianxi Gao

TMLR 2025

/tmlr/2025/huang2025tmlr-language/

Abstract

Transformer-based language models have become the de facto standard in natural language processing. However, they underperform in the tabular data domain compared to traditional tree-based methods. We posit that current models fail to achieve the full potential of language models due to (i) heterogeneity of tabular data; and (ii) challenges faced by the model in interpreting numerical values. Based on this hypothesis, we propose the Tabular Domain Transformer (TDTransformer) framework. TDTransformer has distinct embedding processes for different types of columns. The alignment layers for different column-types transform these embeddings to a common space. Besides, TDTransformer adapts piece-wise linear encoding for numerical values for better performance. We test the proposed method on 76 real-world tabular classification datasets from the OpenML benchmark. Extensive experiments indicate that TDTransformer significantly improves the state-of-the-art methods.

PDF TMLR Code Semantic Scholar

Cite

Text

Huang et al. "Language Models Are Good Tabular Learners." Transactions on Machine Learning Research, 2025.

Markdown

[Huang et al. "Language Models Are Good Tabular Learners." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/huang2025tmlr-language/)

BibTeX

@article{huang2025tmlr-language,
  title     = {{Language Models Are Good Tabular Learners}},
  author    = {Huang, Zhenhan and Srinivas, Kavitha and Samulowitz, Horst and D'Souza, Niharika S. and Aggarwal, Charu C. and Chen, Pin-Yu and Gao, Jianxi},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/huang2025tmlr-language/}
}