Template-Independent News Extraction Based on Visual Consistency

Abstract

Wrapper is a traditional method to extract useful in-formation from Web pages. Most previous works rely on the similarity between HTML tag trees and induced template-dependent wrappers. When hundreds of infor-mation sources need to be extracted in a specific do-main like news, it is costly to generate and maintain the wrappers. In this paper, we propose a novel template-independent news extraction approach to easily identify news articles based on visual consistency. We first rep-resent a page as a visual block tree. Then, by extracting a series of visual features, we can derive a composite visual feature set that is stable in the news domain. Fi-nally, we use a machine learning approach to generate a template-independent wrapper. Experimental results in-dicate that our approach is effective in extracting news across websites, even from unseen websites. The per-formance is as high as around 95 % in terms of F1-value.

Cite

Text

Zheng et al. "Template-Independent News Extraction Based on Visual Consistency." AAAI Conference on Artificial Intelligence, 2007.

Markdown

[Zheng et al. "Template-Independent News Extraction Based on Visual Consistency." AAAI Conference on Artificial Intelligence, 2007.](https://mlanthology.org/aaai/2007/zheng2007aaai-template/)

BibTeX

@inproceedings{zheng2007aaai-template,
  title     = {{Template-Independent News Extraction Based on Visual Consistency}},
  author    = {Zheng, Shuyi and Song, Ruihua and Wen, Ji-Rong},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2007},
  pages     = {1507-1511},
  url       = {https://mlanthology.org/aaai/2007/zheng2007aaai-template/}
}