Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization
Abstract
The project pursued in this paper is to develop from first information-geometric principles a general method for learning the similarity between text documents. Each individual docu(cid:173) ment is modeled as a memoryless information source. Based on a latent class decomposition of the term-document matrix, a low(cid:173) dimensional (curved) multinomial subfamily is learned. From this model a canonical similarity function - known as the Fisher kernel - is derived. Our approach can be applied for unsupervised and supervised learning problems alike. This in particular covers inter(cid:173) esting cases where both, labeled and unlabeled data are available. Experiments in automated indexing and text categorization verify the advantages of the proposed method.
Cite
Text
Hofmann. "Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization." Neural Information Processing Systems, 1999.Markdown
[Hofmann. "Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization." Neural Information Processing Systems, 1999.](https://mlanthology.org/neurips/1999/hofmann1999neurips-learning/)BibTeX
@inproceedings{hofmann1999neurips-learning,
title = {{Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization}},
author = {Hofmann, Thomas},
booktitle = {Neural Information Processing Systems},
year = {1999},
pages = {914-920},
url = {https://mlanthology.org/neurips/1999/hofmann1999neurips-learning/}
}