Modeling String Entries for Tabular Data Prediction: Do We Need Big Large Language Models?
Abstract
Tabular data are often characterized by numerical and categorical features. But these features co-exist with features made of text entries, such as names or descriptions. Here, we investigate whether language models can extract information from these text entries. Studying 19 datasets and varying training sizes, we find that using language model to encode text features improve predictions upon no encodings and character-level approaches based on substrings. Furthermore, we find that larger, more advanced language models translate to more significant improvements.
Cite
Text
Grinsztajn et al. "Modeling String Entries for Tabular Data Prediction: Do We Need Big Large Language Models?." NeurIPS 2023 Workshops: TRL, 2023.Markdown
[Grinsztajn et al. "Modeling String Entries for Tabular Data Prediction: Do We Need Big Large Language Models?." NeurIPS 2023 Workshops: TRL, 2023.](https://mlanthology.org/neuripsw/2023/grinsztajn2023neuripsw-modeling/)BibTeX
@inproceedings{grinsztajn2023neuripsw-modeling,
title = {{Modeling String Entries for Tabular Data Prediction: Do We Need Big Large Language Models?}},
author = {Grinsztajn, Leo and Kim, Myung Jun and Oyallon, Edouard and Varoquaux, Gael},
booktitle = {NeurIPS 2023 Workshops: TRL},
year = {2023},
url = {https://mlanthology.org/neuripsw/2023/grinsztajn2023neuripsw-modeling/}
}