Self Supervised Pre-Training for Large Scale Tabular Data
Abstract
In this paper, we tackle the problem of self supervised pre-training of deep neural networks for large scale tabular data in online advertising. Self supervised learning has recently been very effective for pre-training representations in domains such as vision, natural language processing, etc. But unlike these, designing self supervised learning tasks for tabular data is inherently challenging. Tabular data can consist of various types of data with high cardinality and range of feature values especially in a large scale real world setting. To that end, we propose a self supervised pre-training strategy that utilizes Manifold Mixup to produce data augmentations for tabular data and perform reconstruction on these augmentations using noise contrastive estimation and mean absolute error losses, both of which are particularly suitable for large scale tabular data. We demonstrate its efficacy by evaluating on the problem of click fraud detection on ads to obtain a 9\% relative improvement on robot detection metrics over a supervised learning baseline and 4\% over a contrastive learning experiment.
Cite
Text
Chitlangia et al. "Self Supervised Pre-Training for Large Scale Tabular Data." NeurIPS 2022 Workshops: TRL, 2022.Markdown
[Chitlangia et al. "Self Supervised Pre-Training for Large Scale Tabular Data." NeurIPS 2022 Workshops: TRL, 2022.](https://mlanthology.org/neuripsw/2022/chitlangia2022neuripsw-self/)BibTeX
@inproceedings{chitlangia2022neuripsw-self,
title = {{Self Supervised Pre-Training for Large Scale Tabular Data}},
author = {Chitlangia, Sharad and Muralidhar, Anand and Agarwal, Rajat},
booktitle = {NeurIPS 2022 Workshops: TRL},
year = {2022},
url = {https://mlanthology.org/neuripsw/2022/chitlangia2022neuripsw-self/}
}