FEATURE ARTICLE
Knowledge Discovery and Data Mining
Computers taught to discern patterns, detect anomalies and apply decision algorithms can help secure computer systems and find volcanoes on Venus
Carla Brodley, Terran Lane, Timothy Stough
Recipes for Knowledge
Data mining is just one part of the process of knowledge discovery in data bases (often abbreviated KDD). KDD is an iterative process with six stages: 1) develop an understanding of the proposed application; 2) create a target data set; 3) remove or correct corrupted data; 4) apply data-reduction algorithms; 5) apply a data-mining algorithm; and 6) interpret the mined patterns. Some steps may be skipped, and the process is not necessarily sequential—often the results of one step cause the practitioner to back up to an earlier one. Although research tends to focus on step 5, the steps before and after data mining are equally important. In particular, it takes an expert in the application field, not a KDD expert, to decide whether the mined patterns are meaningful.
When we began to work on the global land-cover problem, we first determined 12 categories of interest: tundra, wooded grassland and so forth. The satellites that provided our data could not count trees or tell grass apart from crops; they could only measure the reflectivity of the surface at certain wavelengths and at certain times of the year. Thus our data-mining objective was to come up with rules for converting those measurements to an appropriate land-cover category for each pixel in our world map.
After we defined the problem, the next step was the creation of training data. This is a subset of the data that trains the data-mining algorithm to interpret the rest of the data correctly—just as a student learns a new subject by solving practice problems. Sometimes the KDD system itself can identify useful portions of the data for training; other times, a domain expert (human or not) performs this task. In our case, we selected only pixels where three existing vegetation maps agreed. In effect, we used the consensus of the three maps as our domain expert.
The choice of training examples is both an important and an extremely challenging task. The sample must be large enough to justify the validity of the discovered knowledge. Otherwise, like a student who has done too few practice problems, the data-mining program is likely to discover "rules" that don't work on other parts of the data. (Because of the low quality of the output, mining small data sets is sometimes called "data dredging.")
After selecting the training data, the next step is to clean the data and select or enhance the relevant features. For example, in a business application a person's income might be relevant to marketing strategies, but his or her Social Security number would not. Data reduction through feature enhancement is particularly important in image databases because of the sheer magnitude of the pixel data. We shall describe this below, in the context of our project to identify volcanoes on Venus.

Next comes the choice of a data-mining algorithm. There are myriad methods available, and the choice depends strongly on the kind of data and the intended use for the mined knowledge. Is the model intended to be predictive or explanatory? Should the patterns discovered be understandable by people, or is reliability the most important consideration? Neural networks, for example, have been very popular in the machine learning community—they have been used to create machines that can recognize barcodes or learn to steer a car. But they are often less suitable for KDD, because they do not explain to a human user how they arrive at their predictions. If the goal of knowledge discovery is human knowledge, a "black box" oracle cannot be a desirable solution.
For the land-cover data we chose a decision-tree algorithm, because it produces a classification structure that is particularly easy for a person to understand. Decision trees have had many business and industrial applications; for example, in the financial industry, they are used to decide whether a person is a good or poor credit risk. We shall explain below what a decision tree is and how a decision-tree algorithm finds it.
Knowledge discovery does not end with the data-mining algorithm. The remaining step is to interpret the meaning of the mined patterns and verify that they are accurate. The interpretation can often be assisted by visualization methods, another large field of research. (See Brown et al. 1995 for an introduction.) Accuracy can be assessed by testing the results on a set of validation data. If the decision tree does not perform well enough, or if nothing of interest has been found, the investigator must go back to a previous step. In the land-cover example, our decision tree predicted the correct type of vegetation over 90 percent of the time, making the results reliable enough to use in climatology models.
» Post Comment