Correlation Clustering

Abstract

We consider the following clustering problem: we have a complete graph on n vertices (items), where each edge ( u , v ) is labeled either + or − depending on whether u and v have been deemed to be similar or different. The goal is to produce a partition of the vertices (a clustering) that agrees as much as possible with the edge labels. That is, we want a clustering that maximizes the number of + edges within clusters, plus the number of − edges between clusters (equivalently, minimizes the number of disagreements: the number of − edges inside clusters plus the number of + edges between clusters). This formulation is motivated from a document clustering problem in which one has a pairwise similarity function f learned from past data, and the goal is to partition the current set of documents in a way that correlates with f as much as possible; it can also be viewed as a kind of “agnostic learning” problem. An interesting feature of this clustering formulation is that one does not need to specify the number of clusters k as a separate parameter, as in measures such as k -median or min-sum or min-max clustering. Instead, in our formulation, the optimal number of clusters could be any value between 1 and n , depending on the edge labels. We look at approximation algorithms for both minimizing disagreements and for maximizing agreements. For minimizing disagreements, we give a constant factor approximation. For maximizing agreements we give a PTAS, building on ideas of Goldreich, Goldwasser, and Ron (1998) and de la Veg (1996). We also show how to extend some of these results to graphs with edge labels in [−1, +1], and give some results for the case of random noise.

Cite

Text

Bansal et al. "Correlation Clustering." Machine Learning, 2004. doi:10.1023/B:MACH.0000033116.57574.95

Markdown

[Bansal et al. "Correlation Clustering." Machine Learning, 2004.](https://mlanthology.org/mlj/2004/bansal2004mlj-correlation/) doi:10.1023/B:MACH.0000033116.57574.95

BibTeX

@article{bansal2004mlj-correlation,
  title     = {{Correlation Clustering}},
  author    = {Bansal, Nikhil and Blum, Avrim and Chawla, Shuchi},
  journal   = {Machine Learning},
  year      = {2004},
  pages     = {89-113},
  doi       = {10.1023/B:MACH.0000033116.57574.95},
  volume    = {56},
  url       = {https://mlanthology.org/mlj/2004/bansal2004mlj-correlation/}
}