Clustering in the Presence of Concept Drift

Abstract

Clustering naturally addresses many of the challenges of data streams and many data stream clustering algorithms (DSCAs) have been proposed. The literature does not, however, provide quantitative descriptions of how these algorithms behave in different circumstances. In this paper we study how the clusterings produced by different DSCAs change, relative to the ground truth, as quantitatively different types of concept drift are encountered. This paper makes two contributions to the literature. First, we propose a method for generating real-valued data streams with precise quantitative concept drift. Second, we conduct an experimental study to provide quantitative analyses of DSCA performance with synthetic real-valued data streams and show how to apply this knowledge to real world data streams. We find that large magnitude and short duration concept drifts are most challenging and that DSCAs with partitioning-based offline clustering methods are generally more robust than those with density-based offline clustering methods. Our results further indicate that increasing the number of classes present in a stream is a more challenging environment than decreasing the number of classes. Code related to this paper is available at: https://doi.org/10.5281/zenodo.1168699 , https://doi.org/10.5281/zenodo.1216189 , https://doi.org/10.5281/zenodo.1213802 , https://doi.org/10.5281/zenodo.1304380 .

Cite

Text

Moulton et al. "Clustering in the Presence of Concept Drift." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2018. doi:10.1007/978-3-030-10925-7_21

Markdown

[Moulton et al. "Clustering in the Presence of Concept Drift." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2018.](https://mlanthology.org/ecmlpkdd/2018/moulton2018ecmlpkdd-clustering/) doi:10.1007/978-3-030-10925-7_21

BibTeX

@inproceedings{moulton2018ecmlpkdd-clustering,
  title     = {{Clustering in the Presence of Concept Drift}},
  author    = {Moulton, Richard Hugh and Viktor, Herna L. and Japkowicz, Nathalie and Gama, João},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2018},
  pages     = {339-355},
  doi       = {10.1007/978-3-030-10925-7_21},
  url       = {https://mlanthology.org/ecmlpkdd/2018/moulton2018ecmlpkdd-clustering/}
}