Comparison of Clustering Metrics and Unsupervised Learning Algorithms on Genome-Wide Gene Expression Level Data

Abstract

With the recent availability of genome-wide DNA sequence information, biologists are left with the overwhelming task of identifying the biological role of every gene in an organism. Technological advances now provide fast and efficient methods to monitor, on a genomic scale, the patterns of gene expression in response to a stimulus, lending key insight about a gene’s function. With this wealth of information comes the need to organize and analyze the data. One natural approach is to group together genes with similar patterns of expression. Several alternatives have been proposed for both the similarity metric and the clustering algorithm (Wen et al. 1998; Eisen et al. 1998). However, these studies used a specific metric-clustering algorithm pair. In our work, we aim to provide a more systematic investigation into the various metric and clustering algorithm alternatives. We also offer two methods to handle missing data. The data sets include a single time course of rat spinal cord development, a single time course of a human cell growth model, and an aggregation of data from the yeast S. cervisiae under several experimental conditions. The data contains missing datapoints in cases of measurement error or inconclusive signal. We consider two techniques for handling missing datapoints, namely weighting by the number of valid points, and linear interpolation. For similarity metrics, we compare a euclidean distance metric, a correlation metric, and a mutual information-based metric. The euclidean metric is commonly used due to its spatially intuitive interpretation of distance and ease of calculation. However, it might fail to recognize negative correlation, thus we use sample correlation to capture both positive and negative correlation. Not all the significant relationships between genes are modelled under either metric. In particular, both summarize the contributions along the whole trajectory, assuming that the type of correlation is constant throughout time. Two genes might be correlated positively within a certain range of their values and negatively related in another range. To capture this type of dependence, we consider a third

Cite

Text

Leach et al. "Comparison of Clustering Metrics and Unsupervised Learning Algorithms on Genome-Wide Gene Expression Level Data." AAAI Conference on Artificial Intelligence, 1999.

Markdown

[Leach et al. "Comparison of Clustering Metrics and Unsupervised Learning Algorithms on Genome-Wide Gene Expression Level Data." AAAI Conference on Artificial Intelligence, 1999.](https://mlanthology.org/aaai/1999/leach1999aaai-comparison/)

BibTeX

@inproceedings{leach1999aaai-comparison,
  title     = {{Comparison of Clustering Metrics and Unsupervised Learning Algorithms on Genome-Wide Gene Expression Level Data}},
  author    = {Leach, Sonia M. and Hunter, Lawrence and Landsman, David},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {1999},
  pages     = {966},
  url       = {https://mlanthology.org/aaai/1999/leach1999aaai-comparison/}
}