A Dynamic Adaptation of AD-Trees for Efficient Machine Learning on Large Data Sets

Abstract

This paper has no novel learning or statistics: it is concerned with making a wide class of preexisting statistics and learning algorithms computationally tractable when faced with data sets with massive numbers of records or attributes. It briefly reviews the static AD-tree structure of Moore and Lee (1998), and offers a new structure with more attractive properties: (1) the new structure scales better with the number of attributes in the data set; (2) it has zero initial build time; (3) it adaptively caches only statistics relevant to the current task; and (4) it can be used incrementally in cases where new data is frequently being appended to the data set. We provide a careful explanation of the data structure, and then empirically evaluate the performance under varying access patterns induced by different learning algorithms such as association rules, decision trees and Bayes net structures. We conclude by discussing the longer term benefits of the new structure: the eventual ability to apply AD-trees to data sets with real-valued attributes. 1. Description of AD-trees 1.1 What is an AD-tree? Table 1 shows a tiny data set with M 3 symbolic (i.e., categorical) attributes (the columns), and R 6 records (the rows). A counting query has the form C(a1 2 a2 1), and is a request to count the number of a3 records matching the query, with asterisks interpreted as “don’t cares”. C(a1 2 a2 a3 1)=3 in our example. Moore and Lee (1998) and Anderson and Moore (1998) introduced a new data structure for representing the cached counting statistics for a categorical data set, called an All-Table 1. Sample data set with three attributes and six records. a2=1 MCV a2=2

Cite

Text

Komarek and Moore. "A Dynamic Adaptation of AD-Trees for Efficient Machine Learning on Large Data Sets." International Conference on Machine Learning, 2000.

Markdown

[Komarek and Moore. "A Dynamic Adaptation of AD-Trees for Efficient Machine Learning on Large Data Sets." International Conference on Machine Learning, 2000.](https://mlanthology.org/icml/2000/komarek2000icml-dynamic/)

BibTeX

@inproceedings{komarek2000icml-dynamic,
  title     = {{A Dynamic Adaptation of AD-Trees for Efficient Machine Learning on Large Data Sets}},
  author    = {Komarek, Paul and Moore, Andrew W.},
  booktitle = {International Conference on Machine Learning},
  year      = {2000},
  pages     = {495-502},
  url       = {https://mlanthology.org/icml/2000/komarek2000icml-dynamic/}
}