MY AMERICAN SCIENTIST
LOG IN! REGISTER!
SEARCH
 
RSS
Logo

FEATURE ARTICLE

Knowledge Discovery and Data Mining

Computers taught to discern patterns, detect anomalies and apply decision algorithms can help secure computer systems and find volcanoes on Venus

Carla Brodley, Terran Lane, Timothy Stough

Decision Trees

For years, rotogravure printing presses have been plagued by a sporadic problem known as "banding." Rotogravure banding takes place when grooves become engraved onto a cylinder surface, and it shows up as a line of ink across pages of printed matter. The printing company RR Donnelley and Sons applied a decision-tree algorithm to discover the cause. Robert Evans, a technology specialist at RR Donnelley, collected machine data from printer operations, both under normal conditions and when banding occurred. The database from which he and Doug Fisher of Vanderbilt University built the decision tree eventually consisted of measurements on 30 different attributes during approximately 500 press runs—250 when banding did not occur and 250 when it did.

Figure 3. Ink streaks mar a printed pageClick to Enlarge Image

Figure 3 shows a subtree of the decision tree that Evans and Fisher constructed. Out of the 30 variables measured, they found that the most predictive was the chrome solution ratio—the proportion of chrome in the plating solution, used to extend the life of the printing plates. When it was "very low" (actual numbers are not given, as they are proprietary to Donnelley), banding always occurred. But banding was also seen sometimes when the chrome solution ratio was "low" or "high," so the algorithm added further branches to the decision tree to identify other risk factors in these cases. Because the decision tree produced rules that could be put into operation, Donnelley was able to reduce the incidence of banding from 538 occurrences in 1989 to 37 in 1997.

As this example shows, a decision tree is essentially a branched list of questions in which the answer to one question determines the next question asked. Each question is called a "node" on the tree, and each answer leads to a separate "branch." At the end of each branch lies a "leaf node," where the decision tree assigns a classification to the data. For example, at the leaf node "very low chrome solution ratio," the model predicts a high risk of banding; at the leaf node "high chrome ratio/low ink temperature/high viscosity," it predicts a low risk. The more clear-cut the classification at each leaf node, the better; but, as this example shows, the data may not allow for a perfect split.

The decision subtree in Figure 3 is fairly simple, with only four test questions and ten leaf nodes. In general, a tree will be larger if there are more categories to separate the data into, or if more complex decision procedures are required. Evans and Fisher had only two categories (banding/no banding). In the land-cover problem, which had 12 categories, our decision tree had 120 questions and 121 leaf nodes. Of course, only a small fraction of these questions would be required to classify any particular pixel.

How, then, does a computer find the right decision tree? The answer is a combination of brute-force search and information theory.

As we have seen, a decision tree is generated from a set of training data, which we can call D. (Initially, D would be the whole training data set; at later steps, it represents only a part of the training data.) Each observation in D has been assigned by an expert to one of several categories, which we denote by C1, C2 and so on. The job of the decision tree is to reproduce this assignment as accurately as possible.

Each node of the decision tree contains a test or splitting rule T. In the simplest case, this is a yes-no question. It splits the data into two mutually exclusive subsets: the observations for which the answer to the question is "yes" and the observations for which the answer is "no." The best test question would be one for which all of the "yes" (or all of the "no") observations fell into one category Ci. (For example, in the printing-press data, when the answer to "Is the chrome solution ratio very low?" was "yes," banding was always observed.) More often, no single test will be able to make such a clear-cut classification at the top of the decision tree. In that case, the decision-tree algorithm looks for the best available test—the one that makes the most progress toward the correct classification. This part is typically done by brute force, simply by considering every possible test. But a crucial question arises: How do we measure "progress" toward our goal? This measurement is the most important element of a decision-tree algorithm.

Figure 4. Entropy functionClick to Enlarge Image

To measure how good a test is, we can recall the game of Twenty Questions, in which one player asks yes-no questions of the other player ("is it bigger than a breadbox?") in order to identify some object in the room. If we consider the set of possible objects to be of fixed size, then we are interested in answering the following questions: Given a set S (whose number of elements is denoted |S|) and the ability to ask yes-no questions about a particular element of the set x, what is the minimum number of questions you need to ask, in the worst case, to determine what x is?

The answer is log2|S|, because the best we can do at each point is split the data in half. Note that we could get a 30/70 split and with luck get into the 30 split; that is, we could immediately identify the object as part of a minority subset, sharply reducing the remaining possibilities. However, on average we'd get into the 30 split only 30 percent of the time, assuming that we are going to try to classify all objects in S.

Next, consider that we have split S into two subsets, S1 and S2. I tell you which subset x is in. How many questions do you still need to ask to determine what x is? The answer now, if x is in subset S1, is log2(|S1|). Moving ahead, and noting that |S1| divided by |S| is the probability that x is a member of S1, the number of questions that you save (on average) by knowing whether x is in S1 or S2 is the entropy of the set S:

entropy(S) =

-

| S1|

-----

| S |

log2

| S1|

(-----)

| S |

-

| S2|

-----

| S |

log2

| S2|

(-----)

| S |

This is an estimate of the information (measured in "bits") contained in knowing whether x is in S1 or in S2 . Note that "bits" is a measure of information, not a measure of binary units. This measure has also been called binits.

When constructing a decision tree, the key issue is to determine which of the available attributes will produce the best partition of the data. We need to know how much information is needed after the test of the attribute. We can use entropy to select the best attribute to test in a decision tree by estimating how entropy would be expected to decrease when we partition the examples according to the chosen attribute. At each step we then choose the test that provides the highest expectation of information gain. We continue until every branch ends at a leaf node or until no gain in information is given by further testing.

The result is often a very large, complex tree that attaches too much significance to the error or noise in the data. During the creation of the tree, such errors show up as exceptions to the rules, which force the model to include extra questions that are either unnecessary or misleading. Therefore, the tree must be pruned back. The result is a tree that may have less than 100 percent accuracy on the training set, but that will make fewer mistakes when it is applied to data outside of the training set.





» Post Comment

 

EMAIL TO A FRIEND :

Of Possible Interest

Computing Science: A Box of Universe

Feature Article: Empirical Software Engineering

Computing Science: An Adventure in the Nth Dimension

Subscribe to American Scientist

Sites of Interest

Duxbury Ventures Website Investments

Social Justice

Find Websites Worth

München Fair Hotels

ABC Fundraising

Promotional Products

Business Cards

Car Hire

Get a Gold Ira at Regal Assets.

Online Shopping