What is a Decision Tree?
A Decision Tree is a type of supervised learning algorithm that can be used for both classification and regression tasks. It works by creating a model that predicts the value of a target variable based on several inputs.
The decision tree is built in a hierarchical structure, with each internal node representing a “test” on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label or a value. The topmost node in the tree is known as the root node.
Image Source: IBM
The decision tree algorithm works by analyzing the training data to construct a decision tree. When making predictions, the algorithm traverses the tree from the root node down to a leaf node.
At each internal node, it looks at the attribute that was used to split the training data. Based on the value of the attribute, it follows the corresponding branch. Once it reaches a leaf node, it returns the class label or value associated with that node.
Decision trees are commonly used in operations research, data mining, and machine learning. It is one of the most popular algorithms due to its simplicity and ease of use. It has the advantage of being able to handle both numerical and categorical data.
In addition, it is able to handle missing values, which makes it robust to noisy data. Furthermore, it is able to provide interpretable models, making it easy to understand the underlying relationships between variables.
Types of Decision Trees
Hunt’s algorithm, which was developed in the 1960s to model human learning in Psychology, forms the foundation of many popular decision tree algorithms, such as the following:
– ID3: Ross Quinlan is credited within the development of ID3, which is shorthand for “Iterative Dichotomiser 3.” This algorithm leverages entropy and information gain as metrics to evaluate candidate splits. Some of Quinlan’s research on this algorithm from 1986 can be found here (PDF, 1.4 MB) (link resides outside of ibm.com).
– C4.5: This algorithm is considered a later iteration of ID3, which was also developed by Quinlan. It can use information gain or gain ratios to evaluate split points within the decision trees.
– CART: The term, CART, is an abbreviation for “classification and regression trees” and was introduced by Leo Breiman. This algorithm typically utilizes Gini impurity to identify the ideal attribute to split on. Gini impurity measures how often a randomly chosen attribute is misclassified. When evaluating using Gini impurity, a lower value is more ideal.
Important Terminology related to Decision Trees
- Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
- Splitting: It is a process of dividing a node into two or more sub-nodes.
- Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
- Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
- Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
- Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
- Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.
Decision trees classify the examples by sorting them down the tree from the root to some leaf/terminal node, with the leaf/terminal node providing the classification of the example.
Each node in the tree acts as a test case for some attribute, and each edge descending from the node corresponds to the possible answers to the test case. This process is recursive in nature and is repeated for every subtree rooted at the new node.
Assumptions while creating Decision Tree
Below are some of the assumptions we make while using Decision tree:
- In the beginning, the whole training set is considered as the root.
- Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.
- Records are distributed recursively on the basis of attribute values.
- Order to placing attributes as root or internal node of the tree is done by using some statistical approach.
Decision Trees follow Sum of Product (SOP) representation. The Sum of product (SOP) is also known as Disjunctive Normal Form. For a class, every branch from the root of the tree to a leaf node having the same class is conjunction (product) of values, different branches ending in that class form a disjunction (sum).
The primary challenge in the decision tree implementation is to identify which attributes do we need to consider as the root node and each level. Handling this is to know as the attributes selection. We have different attributes selection measures to identify the attribute which can be considered as the root note at each level.
How to choose the best attribute at each node
While there are multiple ways to select the best attribute at each node, two methods, information gain and Gini impurity, act as popular splitting criterion for decision tree models. They help to evaluate the quality of each test condition and how well it will be able to classify samples into a class.
Entropy and Information Gain
It’s difficult to explain information gain without first discussing entropy. Entropy is a concept that stems from information theory, which measures the impurity of the sample values. It is defined with by the following formula, where:
- S represents the data set that entropy is calculated
- c represents the classes in set, S
- p(c) represents the proportion of data points that belong to class c to the number of total data points in set, S
Entropy values can fall between 0 and 1. If all samples in data set, S, belong to one class, then entropy will equal zero. If half of the samples are classified as one class and the other half are in another class, entropy will be at its highest at 1. In order to select the best feature to split on and find the optimal decision tree, the attribute with the smallest amount of entropy should be used. Information gain represents the difference in entropy before and after a split on a given attribute. The attribute with the highest information gain will produce the best split as it’s doing the best job at classifying the training data according to its target classification. Information gain is usually represented with the following formula, where:
Information Gain formula
- a represents a specific attribute or class label
- Entropy(S) is the entropy of dataset, S
- |Sv|/ |S| represents the proportion of the values in Sv to the number of values in dataset, S
- Entropy(Sv) is the entropy of dataset, Sv
Let’s walk through an example to solidify these concepts. Imagine that we have the following arbitrary dataset:
For this dataset, the entropy is 0.94. This can be calculated by finding the proportion of days where “Play Tennis” is “Yes”, which is 9/14, and the proportion of days where “Play Tennis” is “No”, which is 5/14. Then, these values can be plugged into the entropy formula above.
Entropy (Tennis) = -(9/14) log2(9/14) – (5/14) log2 (5/14) = 0.94
We can then compute the information gain for each of the attributes individually. For example, the information gain for the attribute, “Humidity” would be the following:
Gain (Tennis, Humidity) = (0.94)-(7/14)*(0.985) – (7/14)*(0.592) = 0.151
As a recap,
– 7/14 represents the proportion of values where humidity equals “high” to the total number of humidity values. In this case, the number of values where humidity equals “high” is the same as the number of values where humidity equals “normal”.
– 0.985 is the entropy when Humidity = “high”
– 0.59 is the entropy when Humidity = “normal”
Then, repeat the calculation for information gain for each attribute in the table above, and select the attribute with the highest information gain to be the first split point in the decision tree. In this case, outlook produces the highest information gain. From there, the process is repeated for each subtree.
Gini impurity is the probability of incorrectly classifying random data point in the dataset if it were labeled based on the class distribution of the dataset. Similar to entropy, if set, S, is pure—i.e. belonging to one class) then, its impurity is zero. This is denoted by the following formula:
Gini impurity formula
Advantages and disadvantages of Decision Trees
Advantages of Decision Trees:
1. Easy to Understand and Interpret: Decision tree models are simple to understand as they require less data pre-processing compared with other algorithms, making them well suited for interpretation by humans.
2. Less Data Cleaning Required: The decision tree algorithm is not affected by outliers or missing values in the dataset, so it requires minimal data cleaning before the model can be built from it.
3. Fast and Accurate Predictions: A well-constructed decision tree can produce accurate predictions quickly without needing a lot of computing power or memory resources for computation tasks such as feature scaling which is required by other algorithms like SVM & kNN .
4. Automatically handles Nonlinear Relationships between Variables : The hierarchical structure of decision trees allows them to handle both linear and nonlinear relationships between variables within a dataset automatically, whereas many machine learning techniques require manually engineering features that capture these interactions.
5. Low Bias – High Variance: Due to their hierarchical structure, Decision Trees naturally tend towards low bias (which means they don’t suffer from overfitting) but high variance (which means results may differ significantly based on small variations in input training datasets).’
Disadvantages of Decision Trees:
1. Complex Decision Trees May Overfit: If a decision tree is too complex, it may overfit the data and lead to inaccurate predictions when presented with new data.
2. Non-Linear Relationships are Difficult to Learn: Decision trees cannot correctly identify non-linear relationships in the data which can lead to inaccurate results or missed opportunities for better prediction accuracy.
3. Unstable Results on Minor Changes in Data Set: Since decision trees take account of all features for prediction, minor changes in feature values could dramatically change structure of Tree leading to different outcomes altogether each time same set of features are used again and again as input dataset changes slightly from previous iteration/run .
4. Not Suitable for Continuous Target Variables : Decision Trees do not perform well when there is continuous target variables due their discreteness nature thus they work best with categorical target variables having discrete values associated with them rather than real numbers etc.,
5. Time Consumption : Learning a tree might consume more time compared other algorithms like Knn & Linear Regression since it involves forming partitioning rules at each node & further recursive splitting until entire sample space has been divided into homogeneous groups