Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment
Abstract
This paper is concerned with the problem of structured data ex-traction from Web pages. The objective of the research is to automatically segment data records in a page, extract data items/fields from these records and store the extracted data in a database. In this paper, we first introduce the extraction problem, and then discuss the main existing approaches and their limitations. After that, we introduce a novel technique (called DEPTA) to automatically perform Web data extraction. The method consists of three steps: (1) identifying data records with similar patterns in a page, (2) aligning and extracting data items from the identified data records and (3) generating tree-based regular expressions to facilitate later extraction from other similar pages. The key inno-vation is the proposal of a new multiple tree alignment algorithm called partial tree alignment, which was found to be particularly suitable for Web data extraction. This paper is based on our work published in KDD-03 and WWW-05.
Cite
Text
Zhai and Liu. "Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment." AAAI Conference on Artificial Intelligence, 2006.Markdown
[Zhai and Liu. "Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment." AAAI Conference on Artificial Intelligence, 2006.](https://mlanthology.org/aaai/2006/zhai2006aaai-automatic/)BibTeX
@inproceedings{zhai2006aaai-automatic,
title = {{Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment}},
author = {Zhai, Yanhong and Liu, Bing},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2006},
pages = {1687-1690},
url = {https://mlanthology.org/aaai/2006/zhai2006aaai-automatic/}
}