Text Bundling: Statistics Based Data-Reduction
Abstract
As text corpora become larger, tradeoffs between speed and accuracy become critical: slow but accurate methods may not complete in a practical amount of time. In order to make the training data a manageable size, a data reduction technique may be necessary. Subsampling, for example, speeds up a classifier by randomly removing training points. In this paper, we describe an alternate method for reducing the number of training points by combining training points such that important statistical information is retained. Our algorithm keeps the same statistics that fast, linear-time text algorithms like Rocchio and Naive Bayes use. We provide empirical results that show our data reduction technique compares favorably to three other data reduction techniques on four standard text corpora. ICML Proceedings of the Twentieth International Conference on Machine Learning
Cite
Text
Shih et al. "Text Bundling: Statistics Based Data-Reduction." International Conference on Machine Learning, 2003.Markdown
[Shih et al. "Text Bundling: Statistics Based Data-Reduction." International Conference on Machine Learning, 2003.](https://mlanthology.org/icml/2003/shih2003icml-text/)BibTeX
@inproceedings{shih2003icml-text,
title = {{Text Bundling: Statistics Based Data-Reduction}},
author = {Shih, Lawrence and Rennie, Jason D. M. and Chang, Yu-Han and Karger, David R.},
booktitle = {International Conference on Machine Learning},
year = {2003},
pages = {696-703},
url = {https://mlanthology.org/icml/2003/shih2003icml-text/}
}