Margin-Based Sampling in High Dimensions: When Being Active Is Less Efficient than Staying Passive
Abstract
It is widely believed that given the same labeling budget, active learning (AL) algorithms like margin-based active learning achieve better predictive performance than passive learning (PL), albeit at a higher computational cost. Recent empirical evidence suggests that this added cost might be in vain, as margin-based AL can sometimes perform even worse than PL. While existing works offer different explanations in the low-dimensional regime, this paper shows that the underlying mechanism is entirely different in high dimensions: we prove for logistic regression that PL outperforms margin-based AL even for noiseless data and when using the Bayes optimal decision boundary for sampling. Insights from our proof indicate that this high-dimensional phenomenon is exacerbated when the separation between the classes is small. We corroborate this intuition with experiments on 20 high-dimensional datasets spanning a diverse range of applications, from finance and histology to chemistry and computer vision.
Cite
Text
Tifrea et al. "Margin-Based Sampling in High Dimensions: When Being Active Is Less Efficient than Staying Passive." International Conference on Machine Learning, 2023.Markdown
[Tifrea et al. "Margin-Based Sampling in High Dimensions: When Being Active Is Less Efficient than Staying Passive." International Conference on Machine Learning, 2023.](https://mlanthology.org/icml/2023/tifrea2023icml-marginbased/)BibTeX
@inproceedings{tifrea2023icml-marginbased,
title = {{Margin-Based Sampling in High Dimensions: When Being Active Is Less Efficient than Staying Passive}},
author = {Tifrea, Alexandru and Clarysse, Jacob and Yang, Fanny},
booktitle = {International Conference on Machine Learning},
year = {2023},
pages = {34222-34262},
volume = {202},
url = {https://mlanthology.org/icml/2023/tifrea2023icml-marginbased/}
}