Semi-Supervised Document Clustering with Simultaneous Text Representation and Categorization

Abstract

In order to derive high quality information from text, the field of text mining has advanced swiftly from simple document clustering to co-clustering with words and categories. However, document co-clustering without any prior knowledge or background information is a challenging problem. In this paper, we propose a Semi-Supervised Non-negative Matrix Factorization (SS-NMF) framework for document co-clustering. Our method computes new word-document and document-category matrices by incorporating user provided constraints through simultaneous distance metric learning and modality selection. Using an iterative algorithm, we perform tri-factorization of the new matrices to infer the document, category and word clusters. Theoretically, we show the convergence and correctness of SS-NMF co-clustering and the advantages of SS-NMF co-clustering over existing approaches. Through extensive experiments conducted on publicly available data sets, we demonstrate the superior performance of SS-NMF for document co-clustering.

Cite

Text

Chen et al. "Semi-Supervised Document Clustering with Simultaneous Text Representation and Categorization." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2009. doi:10.1007/978-3-642-04180-8_31

Markdown

[Chen et al. "Semi-Supervised Document Clustering with Simultaneous Text Representation and Categorization." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2009.](https://mlanthology.org/ecmlpkdd/2009/chen2009ecmlpkdd-semisupervised/) doi:10.1007/978-3-642-04180-8_31

BibTeX

@inproceedings{chen2009ecmlpkdd-semisupervised,
  title     = {{Semi-Supervised Document Clustering with Simultaneous Text Representation and Categorization}},
  author    = {Chen, Yanhua and Wang, Lijun and Dong, Ming},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2009},
  pages     = {211-226},
  doi       = {10.1007/978-3-642-04180-8_31},
  url       = {https://mlanthology.org/ecmlpkdd/2009/chen2009ecmlpkdd-semisupervised/}
}