Less Is More: Active Learning with Support Vector Machines

Abstract

We describe a simple active learning heuristic which greatly enhances the generalization behavior of support vector machines (SVMs) on several practical document classification tasks. We observe a number of benefits, the most surprising of which is that a SVM trained on a wellchosen subset of the available corpus frequently performs better than one trained on all available data. The heuristic for choosing this subset is simple to compute, and makes no use of information about the test set. Given that the training time of SVMs depends heavily on the training set size, our heuristic not only offers better performance with fewer data, it frequently does so in less time than the naive approach of training on all available data. 1. Introduction There are many uses for a good document classifier --- sorting mail into mailboxes, filtering spam or routing news articles. The problem is that learning to classify documents requires manually labelling more documents than a typical...

Cite

Text

Schohn and Cohn. "Less Is More: Active Learning with Support Vector Machines." International Conference on Machine Learning, 2000.

Markdown

[Schohn and Cohn. "Less Is More: Active Learning with Support Vector Machines." International Conference on Machine Learning, 2000.](https://mlanthology.org/icml/2000/schohn2000icml-less/)

BibTeX

@inproceedings{schohn2000icml-less,
  title     = {{Less Is More: Active Learning with Support Vector Machines}},
  author    = {Schohn, Greg and Cohn, David},
  booktitle = {International Conference on Machine Learning},
  year      = {2000},
  pages     = {839-846},
  url       = {https://mlanthology.org/icml/2000/schohn2000icml-less/}
}