Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model
Abstract
Tables on web pages contain a huge amount of seman-tically explicit information, which makes them a worth-while target for automatic information extraction and knowledge acquisition from the Web. However, the task of table extraction from web pages is difficult, because of HTML’s design purpose to convey visual instead of semantic information. In this paper, we propose a ro-bust technique for table extraction from arbitrary web pages. This technique relies upon the positional infor-mation of visualized DOM element nodes in a browser and, hereby, separates the intricacies of code implemen-tation from the actual intended visual appearance. The novel aspect of the proposed web table extraction tech-nique is the effective use of spatial reasoning on the CSS2 visual box model, which shows a high level of ro-bustness even without any form of learning (F-measure ⇡ 90%). We describe the ideas behind our approach, the tabular pattern recognition algorithm operating on a double topographical grid structure and allowing for ef-fective and robust extraction, and general observations on web tables that should be borne in mind by any au-tomatic web table extraction mechanism.
Cite
Text
Gatterbauer and Bohunsky. "Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model." AAAI Conference on Artificial Intelligence, 2006.Markdown
[Gatterbauer and Bohunsky. "Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model." AAAI Conference on Artificial Intelligence, 2006.](https://mlanthology.org/aaai/2006/gatterbauer2006aaai-table/)BibTeX
@inproceedings{gatterbauer2006aaai-table,
title = {{Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model}},
author = {Gatterbauer, Wolfgang and Bohunsky, Paul},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2006},
pages = {1313-1318},
url = {https://mlanthology.org/aaai/2006/gatterbauer2006aaai-table/}
}