Classification and regression trees

Overview Classification and regression trees Wei-Yin Loh Classification and regression trees are machine-learning methods for constructing prediction models from data. The models are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree . Clas- sification trees are designed for dependent variables that take a finite number of unordered values, with prediction error measured in terms of misclassification cost. regression trees are for dependent variables that take continuous or ordered discrete values, with prediction error typically measured by the squared difference between the observed and predicted values. This article gives an in- troduction to the subject by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples.

C 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 14 23 DOI: Classification trees X takes ordered values, the set S is an interval of the form ( , c]. Otherwise, S is a subset of the values I n a Classification problem, we have a training sam- ple of n observations on a class variable Y that takes values 1, 2, .. , k, and p predictor variables, taken by X. The process is applied recursively on the data in each child node. Splitting stops if the relative X1 , .. , Xp . Our goal is to find a model for predict- decrease in impurity is below a prespecified threshold. ing the values of Y from new X values. In theory, the Algorithm 1 gives the pseudocode for the basic steps. solution is simply a partition of the X space into k Algorithm 1 Pseudocode for tree construction disjoint sets, A1 , A2 , .. , Ak, such that the predicted by exhaustive search value of Y is j if X belongs to Aj , for j = 1, 2.)

, k. If the X variables take ordered values, two classical 1. Start at the root node. solutions are linear discriminant analysis1 and nearest neighbor These methods yield sets 2. For each X, find the set S that minimizes Aj with piecewise linear and nonlinear, respectively, the sum of the node impurities in the two boundaries that are not easy to interpret if p is large. child nodes and choose the split {X S }. Classification tree methods yield rectangular that gives the minimum overall X and S. sets Aj by recursively partitioning the data set one 3. If a stopping criterion is reached, exit. Oth- X variable at a time. This makes the sets easier to erwise, apply step 2 to each child node in interpret. For example, Figure 1 gives an example turn. wherein there are three classes and two X variables. The left panel plots the data points and partitions and and CART6 are two later Classification the right panel shows the corresponding decision tree tree algorithms that follow this approach.

Uses structure. A key advantage of the tree structure is its entropy for its impurity function, whereas CART. applicability to any number of variables, whereas the uses a generalization of the binomial variance called plot on its left is limited to at most two. the Gini index. Unlike THAID, however, they first The first published Classification tree algorithm grow an overly large tree and then prune it to a is ,4 Employing a measure of node impurity smaller size to minimize an estimate of the misclassi- based on the distribution of the observed Y values fication error. CART employs 10-fold (default) cross- in the node, THAID splits a node by exhaustively validation, whereass uses a heuristic formula to searching over all X and S for the split {X S} that estimate error rates. CART is implemented in the R. minimizes the total impurity of its two child nodes. If system7 as RPART,8 which we use in the examples below.

Correspondence to: Despite its simplicity and elegance, the ex- Department of Statistics, University of Wisconsin-Madison, haustive search approach has an undesirable prop- Madison, WI, USA erty. Note that an ordered variable with m distinct DOI: values has (m 1) splits of the form X c, and an 14 . c 2011 John Wiley & Sons, Inc. Volume 1, January/February 2011. WIREs Data Mining and Knowledge Discovery Classification and regression trees X2 1. 2 1 1. 1. 1 1. 2 1 11. 1. 1. X1 X1 . 11 1. 2 X2. 2. 1 1. 3. 3 2 2 2 21 1. 31. 22 222 3 1 3 33. 2 3 3 3. 2 2 3. 2 3 3. 2. 3 3 2 33. 2 3. 3 3. 2 3 2 1. X1. F I G U R E 1 | Partitions (left) and decision tree structure (right) for a Classification tree model with three classes labeled 1, 2, and 3. At each intermediate node, a case goes to the left child node if and only if the condition is satisfied. The predicted class is given beneath each leaf node.

Unordered variable with m distinct unordered values 3. Perform a chi squared test of independence has (2m 1 1) splits of the form X S. Therefore, if of each X variable versus Y on the data in everything else is equal, variables that have more dis- the node and compute its significance prob- tinct values have a greater chance to be selected. This ability. selection bias affects the integrity of inferences drawn 4. Choose the variable X associated with the from the tree structure. X that has the smallest significance proba- Building on an idea that originated in the FACT9 bility. algorithm, CRUISE,10,11 GUIDE,12 and QUEST13 use 5. Find the split set {X S } that minimizes a two-step approach based on significance tests to the sum of Gini indexes and use it to split split each node. First, each X is tested for association the node into two child nodes. with Y and the most significant variable is selected.

Then, an exhaustive search is performed for the set 6. If a stopping criterion is reached, exit. Oth- S. Because every X has the same chance to be se- erwise, apply steps 2 5 to each child node. lected if each is independent of Y, this approach is 7. Prune the tree with the CART method. effectively free of selection bias. Besides, much com- putation is saved as the search for S is carried out CHAID15 employs yet another strategy. If X is only on the selected X variable. GUIDE and CRUISE an ordered variable, its data values in the node are use chi squared tests, and QUEST uses chi squared split into 10 intervals and one child node is assigned tests for unordered variables and analysis of variance to each interval. If X is unordered, one child node (ANOVA) tests for ordered variables. CTree,14 an- is assigned to each value of X. Then, CHAID uses other unbiased method, uses permutation tests.

Pseu- significance tests and Bonferroni corrections to try to docode for the GUIDE algorithm is given in Algo- iteratively merge pairs of child nodes. This approach rithm 2. The CRUISE, GUIDE, and QUEST trees are has two consequences. First, some nodes may be split pruned the same way as CART. into more than two child nodes. Second, owing to the Algorithm 2 Pseudocode for GUIDE classifica- sequential nature of the tests and the inexactness of tion tree construction the corrections, the method is biased toward selecting variables with few distinct values. CART, CRUISE, and QUEST can allow splits on linear combinations of all the ordered variables, 1. Start at the root node. whereas GUIDE can split on combinations of two 2. For each ordered variable X, convert it to an variables at a time. If there are missing values, CART. unordered variable X by grouping its values and CRUISE use alternate splits on other variables in the node into a small number of intervals.

When needed, sends each observation with a If X is unordered, set X = X. missing value in a split through every branch using Volume 1, January/February 2011 . c 2011 John Wiley & Sons, Inc. 15. Overview a probability weighting scheme, QUEST imputes the a dominant X variable. Variable passngr is chosen missing values locally, and GUIDE treats missing val- by three algorithms ( , GUIDE QUEST); enginsz, ues as belonging to a separate category. All except fuel, length, and minprice by two; and hp, hwympg, accept user-specified misclassification costs and luggage, maxprice, rev, type, and width by one each. all except and CHAID accept user-specified class Variables airbag, citympg, cylin, midprice, and rpm prior probabilities. By default, all algorithms fit a con- are not selected by any. stant model to each node, predicting Y to be the class When GUIDE does not find a suitable variable with the smallest misclassification cost.

CRUISE can to split a node, it looks for a linear split on a pair optionally fit bivariate linear discriminant models and of variables. One such split, on enginsz and rseat, GUIDE can fit bivariate kernel density and nearest occurs at the node marked with an asterisk (*) in neighbor models in the nodes. GUIDE also can pro- the GUIDE tree . Restricting the linear split to two duce ensemble models using bagging16 and random variables allows the data and the split to be displayed forest17 techniques. Table 1 summarizes the features in a plot as shown in Figure 3. Clearly, no single split of the algorithms. on either variable alone can do as well in separating To see how the algorithms perform in a real ap- the two classes there. plication, we apply them to a data set on new cars Figure 4 shows the , CRUISE, and GUIDE. for the 1993 model There are 93 cars and 25 trees when variable manuf is included.

Classification and regression trees

Tags:

Information

Advertisement

Transcription of Classification and regression trees

Related search queries

Classification and regression trees

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries