Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization

Hofmann, Thomas

Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization

NeurIPS 1999 pp. 914-920

/neurips/1999/hofmann1999neurips-learning/

Abstract

The project pursued in this paper is to develop from first information-geometric principles a general method for learning the similarity between text documents. Each individual docu(cid:173) ment is modeled as a memoryless information source. Based on a latent class decomposition of the term-document matrix, a low(cid:173) dimensional (curved) multinomial subfamily is learned. From this model a canonical similarity function - known as the Fisher kernel - is derived. Our approach can be applied for unsupervised and supervised learning problems alike. This in particular covers inter(cid:173) esting cases where both, labeled and unlabeled data are available. Experiments in automated indexing and text categorization verify the advantages of the proposed method.

PDF NeurIPS Semantic Scholar

Cite

Text

Hofmann. "Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization." Neural Information Processing Systems, 1999.

Markdown

[Hofmann. "Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization." Neural Information Processing Systems, 1999.](https://mlanthology.org/neurips/1999/hofmann1999neurips-learning/)

BibTeX

@inproceedings{hofmann1999neurips-learning,
  title     = {{Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization}},
  author    = {Hofmann, Thomas},
  booktitle = {Neural Information Processing Systems},
  year      = {1999},
  pages     = {914-920},
  url       = {https://mlanthology.org/neurips/1999/hofmann1999neurips-learning/}
}