Compressing Tabular Data via Latent Variable Estimation
Abstract
Data used for analytics and machine learning often take the form of tables with categorical entries. We introduce a family of lossless compression algorithms for such data that proceed in four steps: (i) Estimate latent variables associated to rows and columns; (ii) Partition the table in blocks according to the row/column latents; (iii) Apply a sequential (e.g. Lempel-Ziv) coder to each of the blocks; (iv) Append a compressed encoding of the latents. We evaluate this approach on several benchmark datasets, and study optimal compression in a probabilistic model for tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values. We prove that the model has a well defined entropy rate and satisfies an asymptotic equipartition property. We also prove that classical compression schemes such as Lempel-Ziv and finite-state encoders do not achieve this rate. On the other hand, the latent estimation strategy outlined above achieves the optimal rate.
Cite
Text
Montanari and Weiner. "Compressing Tabular Data via Latent Variable Estimation." International Conference on Machine Learning, 2023.Markdown
[Montanari and Weiner. "Compressing Tabular Data via Latent Variable Estimation." International Conference on Machine Learning, 2023.](https://mlanthology.org/icml/2023/montanari2023icml-compressing/)BibTeX
@inproceedings{montanari2023icml-compressing,
title = {{Compressing Tabular Data via Latent Variable Estimation}},
author = {Montanari, Andrea and Weiner, Eric},
booktitle = {International Conference on Machine Learning},
year = {2023},
pages = {25174-25208},
volume = {202},
url = {https://mlanthology.org/icml/2023/montanari2023icml-compressing/}
}