Identifying the Original Contribution of a Document via Language Modeling

Shaparenko, Benyah; Joachims, Thorsten

doi:10.1007/978-3-642-04174-7_23

Identifying the Original Contribution of a Document via Language Modeling

Benyah Shaparenko, Thorsten Joachims

ECML-PKDD 2009 pp. 350-365

doi:10.1007/978-3-642-04174-7_23 /ecmlpkdd/2009/shaparenko2009ecmlpkdd-identifying/

Abstract

One major goal of text mining is to provide automatic methods to help humans grasp the key ideas in ever-increasing text corpora. To this effect, we propose a statistically well-founded method for identifying the original ideas that a document contributes to a corpus, focusing on self-referential diachronic corpora such as research publications, blogs, email, and news articles. Our statistical model of passage impact defines (interesting) original content through a combination of impact and novelty, and the model is used to identify each document’s most original passages. Unlike heuristic approaches, the statistical model is extensible and open to analysis. We evaluate the approach both on synthetic data and on real data in the domains of research publications and news, showing that the passage impact model outperforms a heuristic baseline method.

PDF ECML-PKDD Semantic Scholar

Cite

Text

Shaparenko and Joachims. "Identifying the Original Contribution of a Document via Language Modeling." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2009. doi:10.1007/978-3-642-04174-7_23

Markdown

[Shaparenko and Joachims. "Identifying the Original Contribution of a Document via Language Modeling." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2009.](https://mlanthology.org/ecmlpkdd/2009/shaparenko2009ecmlpkdd-identifying/) doi:10.1007/978-3-642-04174-7_23

BibTeX

@inproceedings{shaparenko2009ecmlpkdd-identifying,
  title     = {{Identifying the Original Contribution of a Document via Language Modeling}},
  author    = {Shaparenko, Benyah and Joachims, Thorsten},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2009},
  pages     = {350-365},
  doi       = {10.1007/978-3-642-04174-7_23},
  url       = {https://mlanthology.org/ecmlpkdd/2009/shaparenko2009ecmlpkdd-identifying/}
}