Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment

Abstract

This paper is concerned with the problem of structured data ex-traction from Web pages. The objective of the research is to automatically segment data records in a page, extract data items/fields from these records and store the extracted data in a database. In this paper, we first introduce the extraction problem, and then discuss the main existing approaches and their limitations. After that, we introduce a novel technique (called DEPTA) to automatically perform Web data extraction. The method consists of three steps: (1) identifying data records with similar patterns in a page, (2) aligning and extracting data items from the identified data records and (3) generating tree-based regular expressions to facilitate later extraction from other similar pages. The key inno-vation is the proposal of a new multiple tree alignment algorithm called partial tree alignment, which was found to be particularly suitable for Web data extraction. This paper is based on our work published in KDD-03 and WWW-05.

Cite

Text

Zhai and Liu. "Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment." AAAI Conference on Artificial Intelligence, 2006.

Markdown

[Zhai and Liu. "Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment." AAAI Conference on Artificial Intelligence, 2006.](https://mlanthology.org/aaai/2006/zhai2006aaai-automatic/)

BibTeX

@inproceedings{zhai2006aaai-automatic,
  title     = {{Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment}},
  author    = {Zhai, Yanhong and Liu, Bing},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2006},
  pages     = {1687-1690},
  url       = {https://mlanthology.org/aaai/2006/zhai2006aaai-automatic/}
}