Learning Mixtures of Product Distributions Using Correlations and Independence

Abstract

We study the problem of learning mixtures of distributions, a natural formalization of clustering. A mixture of distributions is a collection of distributions D = D1, . . . DT , and mixing weights, w1, . . . , wT such that $\sum_i w_i = 1$. A sample from a mixture is generated by choosing i with probability wi and then choosing a sample from distribution Di. The problem of learning the mixture is that of finding the parameters of the distributions comprising D, given only the ability to sample from the mixture. In this paper, we restrict ourselves to learning mixtures of product distributions. The key to learning the mixtures is to find a few vectors, such that points from different distributions are sharply separated upon projection onto these vectors. Previous techniques use the vectors corresponding to the top few directions of highest variance of the mixture. Unfortunately, these directions may be directions of high noise and not directions along which the distributions are separated. Further, skewed mixing weights amplify the effects of noise, and as a result, previous techniques only work when the separation between the input distributions is large relative to the imbalance in the mixing weights. In this paper, we show an algorithm which successfully learns mixtures of distributions with a separation condition that depends only logarithmically on the skewed mixing weights. In particular, it succeeds for a separation between the centers that is $\Theta(\sigma\sqrt{T}\log\Lambda)$, where $\sigma$ is the maximum directional standard deviation of any distribution in the mixture, T is the number of distributions, and $\Lambda$ is polynomial in T , $\sigma$, log n and the imbalance in the mixing weights. For our algorithm to succeed, we require a spreading condition, that the distance between the centers be spread across $\Theta(T \log \Lambda)$ coordinates. Additionally, with arbitrarily small separation, i.e., even when the separation is not enough for clustering, with enough samples, we can approximate the subspace containing the centers. Previous techniques failed to do so in polynomial time for non-spherical distributions regardless of the number of samples, unless the separation was large with respect to the maximum directional variance $\sigma$ and polynomially large with respect to the imbalance of mixing weights. Our algorithm works for Binary Product Distributions and Axis-Aligned Gaussians. The spreading condition above is implied by the separation condition for binary product distributions, and is necessary for algorithms that rely on linear correlations. Finally, when a stronger version of our spreading condition holds, our algorithm performs successful clustering when the separation between the centers is only $\Theta(\sigma^*\sqrt{T}\log\Lambda)$, where $\sigma^*$ is the maximum directional standard deviation in the subspace containing the centers of the distributions.

Cite

Text

Chaudhuri and Rao. "Learning Mixtures of Product Distributions Using Correlations and Independence." Annual Conference on Computational Learning Theory, 2008.

Markdown

[Chaudhuri and Rao. "Learning Mixtures of Product Distributions Using Correlations and Independence." Annual Conference on Computational Learning Theory, 2008.](https://mlanthology.org/colt/2008/chaudhuri2008colt-learning/)

BibTeX

@inproceedings{chaudhuri2008colt-learning,
  title     = {{Learning Mixtures of Product Distributions Using Correlations and Independence}},
  author    = {Chaudhuri, Kamalika and Rao, Satish},
  booktitle = {Annual Conference on Computational Learning Theory},
  year      = {2008},
  pages     = {9-20},
  url       = {https://mlanthology.org/colt/2008/chaudhuri2008colt-learning/}
}