May 4, 2010 Leave a comment
There are many alternatives to represent classifiers. The decision tree is probably the most widely used approach for this purpose. Originally it has been studied in the fields of decision theory and statistics. However, it was found to be effective in other disciplines such as data mining, machine learning, and pattern recognition. Decision trees are also implemented in many real-world applications. Given the long history and the intense interest in this approach, it is not surprising that several surveys on decision trees are available in the literature. Nevertheless, this survey proposes a profound but concise description of issues related specifically to top-down construction of decision trees, which is considered the most popular construction approach. This paper aims to organize all significant methods developed into a coherent and unified reference.
2. DECISION TREES
A decision tree (or tree diagram) is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal. Another use of decision trees is as a descriptive means for calculating conditional probabilities. In data mining and machine learning, a decision tree is a predictive model; that is, a mapping from observations about an item to conclusions about its target value. More descriptive names for such tree models are classification tree (discrete outcome) or regression tree (continuous outcome). In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications. The machine learning technique for inducing a decision tree from data is called decision tree learning, or (colloquially) decision trees.
3. DECISION TREE REPRESNTATION
The decision tree induction algorithm has been used broadly for several years. It is an approximation discrete function method and can yield lots of useful expressions. It is one of the most important methods for classification. This algorithm’s terms follow the “tree” metaphor. It has a root, which is the first split point of the data attribute for building a decision tree. It also has leaves, so that every path from root to leaf will form a rule that is easily understood. Since the decision tree is built by given data, the data value and character will be more important. For example, the amount of data will affect the result of the tree building procedure. The type of attribute value will also affect the tree model. Decision trees need two kinds of data: Training and Testing.
Training data, which are usually the bigger part of data, are used for constructing trees. The more training data collected, the higher the accuracy of the results. The other group of data, testing, is used to get the accuracy rate and misclassification rate of the decision tree. Many decision-tree algorithms have been developed. One of the most famous is ID3 (Quinlan 1986, 1983), whose choice of split attribute is based on information entropy. C4.5 is an extension of ID3 (Prather et al. 1997). It improves computing efficiency, deals with continuous values, handles attributes with missing values, avoids over fitting, and performs other functions.
CART (Classification and Regression tree) is a data-exploration and prediction algorithm similar to C4.5, which is a tree construction algorithm. Breiman et al. (1984) summarized the classification and regression tree. Instead of information entropy, it introduces measures of node impurity. It is used on a variety of different problems, such as the detection of chlorine from the data contained in a mass spectrum). Although decision trees may not be the best method for classification accuracy, even people who are not familiar with them find them easy to use and understand. Figure 1 shows a binary decision tree. It gives us an impression of a decision. It uses a circle as the decision node and a square as the terminal node. Each decision node has a condition that is represented by a function F, and the parameter is the split point of the split attribute. Each terminal node has a class label C, the value of which represents a class. It is apparent that it is easy to use decision trees to interpret the tree to rules, from which we can do analysis, and easy to interpret the representation of a nonlinear input-output mapping (Jang 1994).
Figure 1: A Typical binary Decision tree
Figure 1. A typical binary decision tree Lots of works address the splitting node choosing method and optimization of tree size, but less attention has been given to the weight of the data attributes. In this study, we use a system-reconstruction analysis method to get the weight of each attribute, which we use to reform raw data. After that, we use the decision-tree algorithm mentioned above to build a decision tree, from which we can find the decision-accuracy and misclassification rates.
4. ID3 ALGORITHM
The ID3 algorithm can be summarized as follows:
Take all unused attributes and count their entropy concerning test samples
Choose attribute for which entropy is maximum Make node containing that attribute
The algorithm is as follows:
According to Gestwicki, Itemized Dichotomozer 3 algorithm, or better known as ID3 algorithm was first introduced by J.R Quinlan in the late 1970’s. The algorithm ‘learned’ from relatively small training set of data to organize and process very large data sets. Ballard stated that ID3 algorithm is a greedy algorithm that selects the next attributes based on the information gain associated with the attributes. The information gain is measured by entropy, where Claude Shannon first introduced the idea in 1948.
ID3 algorithm prefers that the generated tree is shorter and the attributes with lower entropies are put near the top of the tree. These techniques satisfy the idea of Occam’s Razor. Occam’s Razor stated that, “one should not increase, beyond what is necessary, the number of entities required to explain anything”, which means that one should not make more assumptions than minimum needed. Hild described the basic technique on the implementation of ID3 algorithm and it is shown below.
For each uncategorized attribute, its entropy would be calculated with respect to the categorized attribute or conclusion. The attribute with lowest entropy would be selected. The data would be divided into sets according to the attribute’s value. For example, if the attribute ‘Size’ was chosen, and the values for ‘Size’ were ‘big’, ‘medium’ and ‘small, therefore three sets would be created, divided by these values. A tree with branches that represent the sets would be constructed. For the above example, three branches would be created where first branch would be ‘big’, second branch would be ‘medium’ and third branch would be ‘small’. Step 1 would be repeated for each branch, but the already selected attribute would be removed and the data used was only the data that exists in the sets. The process stopped when there were no more attribute to be considered or the data in the set had the same conclusion, for example, all data had the ‘Result’ = yes.
ID3 algorithm had been used and implemented in many fields. One of the earliest implementation of ID3 algorithm is on a chess game. Ivan Bratko, the artificial intelligence researcher was the one implemented this chess game. According to Gestwicki, Bratko supplied the ID3 program with several pages of textbook recommendations for playing the chess endgame of white king and rook versus black king and knight. He made the rules around the idea of ‘knight’s side lost in at most n moves’. The result shows that ID3 algorithm is efficient in both time and space considerations, where the featur
e vector of the games and the decision tree size is small, compared to the training instances.
In a study by Gestwicki, one experiment had been conducted to predict the greyhound race. The experiment was to compare between the net profit gained by the ID3 algorithm and by three greyhound-racing experts. In this experiment, the system had been trained with 200 training races and 1600 dogs. The result shows that there are 26 races that the ID3 did not make any bet. This showed that the system was restricted from making any illogical choices, which is unlike human that were to gamble without logic in order to gain more winning.
5. C4.5 ALGORITHM
At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the smaller sublists. This algorithm has a few base cases.
All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying to choose that class. None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the tree using the expected value of the class. Instance of previously-unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value.
In pseudo code the algorithm is
Check for base cases For each attribute a Find the normalized information gain Let a_best be the attribute with the highest normalized information gain Create a decision node that splits on a_best Recurse on the sublists obtained by splitting on a_best, and add those nodes as children of node Improvements from ID3 algorithm
C4.5 made a number of improvements to ID3. Some of these are:
Handling both continuous and discrete attributes – In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. Handling training data with missing attribute values – C4.5 allows attribute values to be marked for missing. Missing attribute values are simply not used in gain and entropy calculations. Handling attributes with differing costs. Pruning trees after creation – C4.5 goes back through the tree once it’s been created and attempts to remove branches that do not help by replacing them with leaf nodes.
6. CART ALGORITHM
Classification and regression trees (CART) is a non-parametric technique that produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric, respectively. Trees are formed by a collection of rules based on values of certain variables in the modeling data set.
Rules are selected based on how well splits based on variables’ values can differentiate observations based on the dependent variable Once a rule is selected and splits a node into two, the same logic is applied to each “child” node (i.e. it is a recursive procedure) Splitting stops when CART detects no further gain can be made, or some pre-set stopping rules are met
Each branch of the tree ends in a terminal node
Each observation falls into one and exactly one terminal node Each terminal node is uniquely defined by a set of rules
The basic idea of tree growing is to choose a split among all the possible splits at each node so that the resulting child nodes are the “purest”. In this algorithm, only univariate splits are considered. That is, each split depends on the value of only one predictor variable. All possible splits consist of possible splits of each predictor.
7. COMPARISON OF ID3, C4.5 and CART
Algorithm designers have had much success with greedy, divide-and-conquer approaches to building class descriptions. It is chosen decision tree learners made popular by ID3, C4.5 (Quinlan1986) and CART (Breiman, Friedman, Olshen, and Stone 1984) for this survey, because they are relatively fast and typically they produce competitive classifiers. In fact, the decision tree generator C4.5, a successor to ID3, has become a standard factor for comparison in machine learning research, because it produces good classifiers quickly. For non numeric datasets, the growth of the run time of ID3 (and C4.5) is linear in all examples.
The practical run time complexity of C4.5 has been determined empirically to be worse than O (e2) on some datasets. One possible explanation is based on the observation of Oates and Jensen (1998) that the size of C4.5 trees increases linearly with the number of examples. One of the factors of a in C4.5’s run-time complexity corresponds to the tree depth, which cannot be larger than the number of attributes. Tree depth is related to tree size, and thereby to the number of examples. When compared with C4.5, the run time complexity of CART is satisfactory.
The decision-tree algorithm is one of the most effective classification methods. The data will judge the efficiency and correction rate of the algorithm. The survey is made on the decision tree algorithms ID3, C4.5 and CART towards their steps of processing data and Complexity of running data. The inductive learning algorithms had successfully recognized and generalized the rules contains in the training data given. The accuracies for the algorithms were also very high, which means the system produced a reliable result. This result also showed that inductive learning can be successfully implemented in a complex problem domain, and therefore it is very useful to be implemented in the real world problems. The second conclusion is that the algorithms had the ability to learn new rules and therefore had the ability to adapt to changes. Finally it can be concluded that between the three algorithms, the CART algorithm performs better in performance of rules generated and accuracy. CART algorithm produced less rules yet was more accurate than the other two algorithms. This showed that the CART algorithm is better in induction and rules generalization compared to ID3 algorithm and C4.5 algorithm.
First, I would like to thank Almighty for His blessings towards the successful completion of this survey paper. I would like to extend my thanks to my Research Guide Dr. (Mrs.) M. Punithavalli, Director, Dept. of Computer Science, Sri Rama Krishna College for Women, Coimbatore for her valuable assistance, help and guidance during the research process. I also would like to extend my gratitude to my Husband Mr. M. S. Raja Sekaran for his moral support and co-operation.
 S. R. Safavin and D. Landgrebe. A survey of decision tree classifier methodology. IEEE Trans. on Systems, Man and Cybernetics, 21(3):660-674, 1991.
 S. K. Murthy, Automatic Construction of Decision Trees from Data: A MultiDisciplinary Survey. Data Mining and Knowledge Discovery, 2(4):345-389, 1998.
 R. Kohavi and J. R. Quinlan. Decision-tree discovery. In Will Klosgen and Jan M. Zytkow, editors, Handbook of Data Mining and Knowledge Discovery, chapter 16.1.3, pages 267-276. Oxford University Press, 2002.
 S. Grumbach and T. Milo: Towards Tractable Algebras for Bags. Journal of Computer and System Sciences 52(3): 570-588, 1996. IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS: PART C, VOL. 1, NO. 11, NOVEMBER 2002 11
 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth Int. Group, 1984.
 J.R. Quinlan, Simplifying decision trees, International Journal of Man-Machine Studies, 27, 221-234, 1987.
 T. R. Hancock, T. Jiang,
M. Li, J. Tromp: Lower Bounds on Learning Decision Lists and Trees. Information and Computation 126(2): 114-122, 1996.
 L. Hyafil and R.L. Rivest. Constructing optimal binary decision trees is NP-complete. Information Processing Letters, 5(1):15-17, 1976
 H. Zantema and H. L. Bodlaender, Finding Small Equivalent Decision Trees is Hard, International Journal of Foundations of Computer Science, 11(2):343-354, 2000.
 G.E. Naumov. NP-completeness of problems of construction of optimal decision trees. Soviet Physics: Doklady, 36(4):270-271, 1991.
 J.R. Quinlan, Induction of decision trees, Machine Learning 1, 81-106, 1986.