Real-Time Full-Text Clustering of Networked Documents

Abstract

With the recent explosion of available on-line information, query-based search engines (e.g., AltaVista) and manually constructed topical hierarchies (e.g., Yahoo!) have proven to be valuable. However, these tools alone are becoming inadequate as query results grow unwieldy and manual classi cation in topic hierarchies creates an immense information bottleneck. We address these problems with a system for topical information space navigation that combines query-based and taxonomic systems by employing Machine Learning to create dynamic document categorizations based on the full-text of articles that are germane to a user's query. Our system, named SONIA (Service for Organizing Networked Information Autonomously), has been implemented as part of the Stanford Digital Libraries Testbed [Gro95]. SONIA takes as input a list of document handles (generally URLs for Web documents, although other distributed data sources, such as DIALOG are supported) and employs a document retriever (i.e., Web crawler) capable of robust, real-time retrieval of the full text of up to 250 documents in parallel. Upon retrieving documents, SONIA parses the text into alphanumeric terms (i.e., words), and uses this term set to transform textual documents to a vector-based representation. The dimensions of each vector represent terms encountered in the text of the set of all the retrieved documents and feature values are simply the (normalized) count of that term in the given document. Since the number of distinct terms in text is very large (10 for even small collections), feature selection becomes necessary. SONIA uses multistage feature selection, using both Natural Language phenomena as well as statistical techniques to successfully reduce the feature set by as much as an order of magnitude or more. Initial feature selection involves stopword (non-meaningful term) removal. Next, SONIA employs a feature selection method based on a Zipf's Law analysis of word occurence over the collection of document vectors. Finally, a Term Frequency-Inverse Document Frequency (TFIDF) metric [SB87] is used to eliminate features (terms) that appear too (in)frequently to have much distinguishing power.

Cite

Text

Sahami et al. "Real-Time Full-Text Clustering of Networked Documents." AAAI Conference on Artificial Intelligence, 1997.

Markdown

[Sahami et al. "Real-Time Full-Text Clustering of Networked Documents." AAAI Conference on Artificial Intelligence, 1997.](https://mlanthology.org/aaai/1997/sahami1997aaai-real/)

BibTeX

@inproceedings{sahami1997aaai-real,
  title     = {{Real-Time Full-Text Clustering of Networked Documents}},
  author    = {Sahami, Mehran and Yusufali, Salim and Baldonado, Michelle Q. Wang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {1997},
  pages     = {845},
  url       = {https://mlanthology.org/aaai/1997/sahami1997aaai-real/}
}