Beyond the Boundaries of SMOTE - A Framework for Manifold-Based Synthetically Oversampling

Abstract

Problems of class imbalance appear in diverse domains, ranging from gene function annotation to spectra and medical classification. On such problems, the classifier becomes biased in favour of the majority class. This leads to inaccuracy on the important minority classes, such as specific diseases and gene functions. Synthetic oversampling mitigates this by balancing the training set, whilst avoiding the pitfalls of random under and oversampling. The existing methods are primarily based on the SMOTE algorithm, which employs a bias of randomly generating points between nearest neighbours. The relationship between the generative bias and the latent distribution has a significant impact on the performance of the induced classifier. Our research into gamma-ray spectra classification has shown that the generative bias applied by SMOTE is inappropriate for domains that conform to the manifold property, such as spectra, text, image and climate change classification. To this end, we propose a framework for manifold-based synthetic oversampling, and demonstrate its superiority in terms of robustness to the manifold with respect to the AUC on three spectra classification tasks and 16 UCI datasets.

Cite

Text

Bellinger et al. "Beyond the Boundaries of SMOTE - A Framework for Manifold-Based Synthetically Oversampling." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2016. doi:10.1007/978-3-319-46128-1_16

Markdown

[Bellinger et al. "Beyond the Boundaries of SMOTE - A Framework for Manifold-Based Synthetically Oversampling." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2016.](https://mlanthology.org/ecmlpkdd/2016/bellinger2016ecmlpkdd-beyond/) doi:10.1007/978-3-319-46128-1_16

BibTeX

@inproceedings{bellinger2016ecmlpkdd-beyond,
  title     = {{Beyond the Boundaries of SMOTE - A Framework for Manifold-Based Synthetically Oversampling}},
  author    = {Bellinger, Colin and Drummond, Christopher and Japkowicz, Nathalie},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2016},
  pages     = {248-263},
  doi       = {10.1007/978-3-319-46128-1_16},
  url       = {https://mlanthology.org/ecmlpkdd/2016/bellinger2016ecmlpkdd-beyond/}
}