Margin-Based Sampling in High Dimensions: When Being Active Is Less Efficient than Staying Passive

Abstract

It is widely believed that given the same labeling budget, active learning (AL) algorithms like margin-based active learning achieve better predictive performance than passive learning (PL), albeit at a higher computational cost. Recent empirical evidence suggests that this added cost might be in vain, as margin-based AL can sometimes perform even worse than PL. While existing works offer different explanations in the low-dimensional regime, this paper shows that the underlying mechanism is entirely different in high dimensions: we prove for logistic regression that PL outperforms margin-based AL even for noiseless data and when using the Bayes optimal decision boundary for sampling. Insights from our proof indicate that this high-dimensional phenomenon is exacerbated when the separation between the classes is small. We corroborate this intuition with experiments on 20 high-dimensional datasets spanning a diverse range of applications, from finance and histology to chemistry and computer vision.

Cite

Text

Tifrea et al. "Margin-Based Sampling in High Dimensions: When Being Active Is Less Efficient than Staying Passive." International Conference on Machine Learning, 2023.

Markdown

[Tifrea et al. "Margin-Based Sampling in High Dimensions: When Being Active Is Less Efficient than Staying Passive." International Conference on Machine Learning, 2023.](https://mlanthology.org/icml/2023/tifrea2023icml-marginbased/)

BibTeX

@inproceedings{tifrea2023icml-marginbased,
  title     = {{Margin-Based Sampling in High Dimensions: When Being Active Is Less Efficient than Staying Passive}},
  author    = {Tifrea, Alexandru and Clarysse, Jacob and Yang, Fanny},
  booktitle = {International Conference on Machine Learning},
  year      = {2023},
  pages     = {34222-34262},
  volume    = {202},
  url       = {https://mlanthology.org/icml/2023/tifrea2023icml-marginbased/}
}