LLM-Guided Self-Supervised Tabular Learning with Task-Specific Pre-Text Tasks

Abstract

One of the most common approaches for self-supervised representation learning is defining pre-text tasks to learn data representations. Existing works determine pre-text tasks in a "task-agnostic'' way, without considering the forthcoming downstream tasks. This offers an advantage of broad applicability across tasks, but can also lead to a mismatch between task objectives, potentially degrading performance on downstream tasks. In this paper, we introduce TST-LLM, a framework that effectively reduces this mismatch when the natural language-based description of the downstream task is given without any ground-truth labels. TST-LLM instructs the LLM to use the downstream task's description and meta-information of data to discover features relevant to the target task. These discovered features are then treated as ground-truth labels to define "target-specific'' pre-text tasks. TST-LLM consistently outperforms contemporary baselines, such as STUNT and LFR, with win ratios of 95% and 81%, when applied to 22 benchmark tabular datasets, including binary and multi-class classification, and regression tasks.

Cite

Text

Han et al. "LLM-Guided Self-Supervised Tabular Learning with Task-Specific Pre-Text Tasks." Transactions on Machine Learning Research, 2025.

Markdown

[Han et al. "LLM-Guided Self-Supervised Tabular Learning with Task-Specific Pre-Text Tasks." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/han2025tmlr-llmguided/)

BibTeX

@article{han2025tmlr-llmguided,
  title     = {{LLM-Guided Self-Supervised Tabular Learning with Task-Specific Pre-Text Tasks}},
  author    = {Han, Sungwon and Lee, Seungeon and Cha, Meeyoung and Arik, Sercan O and Yoon, Jinsung},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/han2025tmlr-llmguided/}
}