Template-Independent News Extraction Based on Visual Consistency
Abstract
Wrapper is a traditional method to extract useful in-formation from Web pages. Most previous works rely on the similarity between HTML tag trees and induced template-dependent wrappers. When hundreds of infor-mation sources need to be extracted in a specific do-main like news, it is costly to generate and maintain the wrappers. In this paper, we propose a novel template-independent news extraction approach to easily identify news articles based on visual consistency. We first rep-resent a page as a visual block tree. Then, by extracting a series of visual features, we can derive a composite visual feature set that is stable in the news domain. Fi-nally, we use a machine learning approach to generate a template-independent wrapper. Experimental results in-dicate that our approach is effective in extracting news across websites, even from unseen websites. The per-formance is as high as around 95 % in terms of F1-value.
Cite
Text
Zheng et al. "Template-Independent News Extraction Based on Visual Consistency." AAAI Conference on Artificial Intelligence, 2007.Markdown
[Zheng et al. "Template-Independent News Extraction Based on Visual Consistency." AAAI Conference on Artificial Intelligence, 2007.](https://mlanthology.org/aaai/2007/zheng2007aaai-template/)BibTeX
@inproceedings{zheng2007aaai-template,
title = {{Template-Independent News Extraction Based on Visual Consistency}},
author = {Zheng, Shuyi and Song, Ruihua and Wen, Ji-Rong},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2007},
pages = {1507-1511},
url = {https://mlanthology.org/aaai/2007/zheng2007aaai-template/}
}