PS3: Partition-Based Skew-Specialized Sampling for Batch Mode Active Learning in Imbalanced Text Data

Abstract

While social media has taken a fixed place in our daily life, its steadily growing prominence also exacerbates the problem of hostile contents and hate-speech. These destructive phenomena call for automatic hate-speech detection, which, however, is facing two major challenges, namely i) the dynamic nature of online content causing significant data-drift over time, and ii) a high class-skew, as hate-speech represents a relatively small fraction of the overall online content. The first challenge naturally calls for a batch mode active learning solution, which updates the detection system by querying human domain-experts to annotate meticulously selected batches of data instances. However, little prior work exists on batch mode active learning with high class-skew, and in particular for the problem of hate-speech detection. In this work, we propose a novel partition-based batch mode active learning framework to address this problem. Our framework falls into the so-called screening approach, which pre-selects a subset of most uncertain data items and then selects a representative set from this uncertainty space. To tackle the class-skew problem, we use a data-driven skew-specialized cluster representation, with a higher potential to “cherry pick” minority classes. In extensive experiments we demonstrate substantial improvements in terms of G-Means, and F1 measure, over several baseline approaches and multiple datasets, for highly imbalanced class ratios.

Cite

Text

Fajri et al. "PS3: Partition-Based Skew-Specialized Sampling for Batch Mode Active Learning in Imbalanced Text Data." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2020. doi:10.1007/978-3-030-67670-4_5

Markdown

[Fajri et al. "PS3: Partition-Based Skew-Specialized Sampling for Batch Mode Active Learning in Imbalanced Text Data." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2020.](https://mlanthology.org/ecmlpkdd/2020/fajri2020ecmlpkdd-ps3/) doi:10.1007/978-3-030-67670-4_5

BibTeX

@inproceedings{fajri2020ecmlpkdd-ps3,
  title     = {{PS3: Partition-Based Skew-Specialized Sampling for Batch Mode Active Learning in Imbalanced Text Data}},
  author    = {Fajri, Ricky Maulana and Khoshrou, Samaneh and Peharz, Robert and Pechenizkiy, Mykola},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2020},
  pages     = {68-84},
  doi       = {10.1007/978-3-030-67670-4_5},
  url       = {https://mlanthology.org/ecmlpkdd/2020/fajri2020ecmlpkdd-ps3/}
}