Tag-Weighted Topic Model for Mining Semi-Structured Documents
Abstract
In the last decade, latent Dirichlet allocation (LDA) successfully discovers the statistical distribution of the topics over a unstructured text corpus. Meanwhile, more and more document data come up with rich human-provided tag information during the evolution of the Internet, which called semi- structured data. The semi-structured data contain both unstructured data (e.g., plain text) and metadata, such as papers with authors and web pages with tags. In general, different tags in a document play different roles with their own weights. To model such semi-structured documents is non-trivial. In this paper, we propose a novel method to model tagged documents by a topic model, called Tag-Weighted Topic Model (TWTM). TWTM is a framework that leverages the tags in each document to infer the topic components for the documents. This allows not only to learn document-topic distributions, but also to infer the tag-topic distributions for text mining (e.g., classification, clustering, and recommendations). Moreover, TWTM automatically infers the probabilistic weights of tags for each document. We present an efficient variational inference method with an EM algorithm for estimating the model parameters. The experimental results show that our TWTM approach outperforms the baseline algorithms over three corpora in document modeling and text classification.
Cite
Text
Li et al. "Tag-Weighted Topic Model for Mining Semi-Structured Documents." International Joint Conference on Artificial Intelligence, 2013.Markdown
[Li et al. "Tag-Weighted Topic Model for Mining Semi-Structured Documents." International Joint Conference on Artificial Intelligence, 2013.](https://mlanthology.org/ijcai/2013/li2013ijcai-tag/)BibTeX
@inproceedings{li2013ijcai-tag,
title = {{Tag-Weighted Topic Model for Mining Semi-Structured Documents}},
author = {Li, Shuangyin and Li, Jiefei and Pan, Rong},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2013},
pages = {2855-2861},
url = {https://mlanthology.org/ijcai/2013/li2013ijcai-tag/}
}