How Decision Tree Algorithm works
Introduction to Decision Tree Algorithm
Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, decision tree algorithm can be used for solving regression and classification problems too.
The general motive of using Decision Tree is to create a training model which can use to predict class or value of target variables by learning decision rules inferred from prior data(training data).
The understanding level of Decision Trees algorithm is so easy compared with other classification algorithms. The decision tree algorithm tries to solve the problem, by using tree representation. Each internal node of the tree corresponds to an attribute, and each leaf node corresponds to a class label.
Decision Tree Algorithm Pseudocode
 Place the best attribute of the dataset at the root of the tree.
 Split the training set into subsets. Subsets should be made in such a way that each subset contains data with the same value for an attribute.
 Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree.
In decision trees, for predicting a class label for a record we start from the root of the tree. We compare the values of the root attribute with record’s attribute. On the basis of comparison, we follow the branch corresponding to that value and jump to the next node.
We continue comparing our record’s attribute values with other internal nodes of the tree until we reach a leaf node with predicted class value. As we know how the modeled decision tree can be used to predict the target class or the value. Now let’s understanding how we can create the decision tree model.
Assumptions while creating Decision Tree
The below are the some of the assumptions we make while using Decision tree:
 At the beginning, the whole training set is considered as the root.
 Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.
 Records are distributed recursively on the basis of attribute values.
 Order to placing attributes as root or internal node of the tree is done by using some statistical approach.
Decision Trees follow Sum of Product (SOP) representation. For the above images, you can see how we can predict can we accept the new job offer? and Use computer daily? from traversing for the root node to the leaf node.
It’s a sum of product representation. The Sum of product(SOP) is also known as Disjunctive Normal Form. For a class, every branch from the root of the tree to a leaf node having the same class is a conjunction(product) of values, different branches ending in that class form a disjunction(sum).
The primary challenge in the decision tree implementation is to identify which attributes do we need to consider as the root node and each level. Handling this is know the attributes selection. We have different attributes selection measure to identify the attribute which can be considered as the root note at each level.
The popular attribute selection measures:
 Information gain
 Gini index
Attributes Selection
If dataset consists of “n” attributes then deciding which attribute to place at the root or at different levels of the tree as internal nodes is a complicated step. By just randomly selecting any node to be the root can’t solve the issue. If we follow a random approach, it may give us bad results with low accuracy.
For solving this attribute selection problem, researchers worked and devised some solutions. They suggested using some criterion like information gain, gini index, etc. These criterions will calculate values for every attribute. The values are sorted, and attributes are placed in the tree by following the order i.e, the attribute with a high value(in case of information gain) is placed at the root.
While using information Gain as a criterion, we assume attributes to be categorical, and for gini index, attributes are assumed to be continuous.
Information Gain
By using information gain as a criterion, we try to estimate the information contained by each attribute. We are going to use some points deducted from information theory.
To measure the randomness or uncertainty of a random variable X is defined by Entropy.
For a binary classification problem with only two classes, positive and negative class.
 If all examples are positive or all are negative then entropy will be zero i.e, low.
 If half of the records are of positive class and half are of negative class then entropy is one i.e, high.
By calculating entropy measure of each attribute we can calculate their information gain. Information Gain calculates the expected reduction in entropy due to sorting on the attribute. Information gain can be calculated.
To get a clear understanding of calculating information gain & entropy, we will try to implement it on a sample data.
Example: Construct a Decision Tree by using “information gain” as a criterion
We are going to use this data sample. Let’s try to use information gain as a criterion. Here, we have 5 columns out of which 4 columns have continuous data and 5th column consists of class labels.
A, B, C, D attributes can be considered as predictors and E column class labels can be considered as a target variable. For constructing a decision tree from this data, we have to convert continuous data into categorical data.
We have chosen some random values to categorize each attribute:
A  B  C  D 
>= 5  >= 3.0  >= 4.2  >= 1.4 
< 5  < 3.0  < 4.2  < 1.4 
There are 2 steps for calculating information gain for each attribute:
 Calculate entropy of Target.
 Entropy for every attribute A, B, C, D needs to be calculated. Using information gain formula we will subtract this entropy from the entropy of target. The result is Information Gain.
The entropy of Target: We have 8 records with negative class and 8 records with positive class. So, we can directly estimate the entropy of target as 1.
Variable E  
Positive  Negative 
8  8 
Calculating entropy using formula:
E(8,8) = 1*( (p(+ve)*log( p(+ve)) + (p(ve)*log( p(ve)) )
= 1*( (8/16)*log_{2}(8/16)) + (8/16) * log_{2}(8/16) )
= 1
Information gain for Var A
Var A has value >=5 for 12 records out of 16 and 4 records with value <5 value.
 For Var A >= 5 & class == positive: 5/12
 For Var A >= 5 & class == negative: 7/12
 Entropy(5,7) = 1 * ( (5/12)*log2(5/12) + (7/12)*log2(7/12)) = 0.9799
 For Var A <5 & class == positive: 3/4
 For Var A <5 & class == negative: 1/4
 Entropy(3,1) = 1 * ( (3/4)*log2(3/4) + (1/4)*log2(1/4)) = 0.81128
Entropy(Target, A) = P(>=5) * E(5,7) + P(<5) * E(3,1)
= (12/16) * 0.9799 + (4/16) * 0.81128 = 0.937745
$latex \textrm{Information Gain(IG) = E(Target) – E(Target,A) = 1 0.9337745 = 0.062255} $
Information gain for Var B
Var B has value >=3 for 12 records out of 16 and 4 records with value <5 value.
 For Var B >= 3 & class == positive: 8/12
 For Var B >= 3 & class == negative: 4/12
 Entropy(8,4) = 1 * ( (8/12)*log2(8/12) + (4/12)*log2(4/12)) = 0.39054
 For VarB <3 & class == positive: 0/4
 For Var B <3 & class == negative: 4/4
 Entropy(0,4) = 1 * ( (0/4)*log2(0/4) + (4/4)*log2(4/4)) = 0
Entropy(Target, B) = P(>=3) * E(8,4) + P(<3) * E(0,4)
= (12/16) * 0.39054 + (4/16) * 0 = 0.292905
$latex \textrm{Information Gain(IG) = E(Target) – E(Target,B) = 1 0.292905= 0.707095} $
Information gain for Var C
Var C has value >=4.2 for 6 records out of 16 and 10 records with value <4.2 value.
 For Var C >= 4.2 & class == positive: 0/6
 For Var C >= 4.2 & class == negative: 6/6
 Entropy(0,6) = 0
 For VarC < 4.2 & class == positive: 8/10
 For Var C < 4.2 & class == negative: 2/10
 Entropy(8,2) = 0.72193
Entropy(Target, C) = P(>=4.2) * E(0,6) + P(< 4.2) * E(8,2)
= (6/16) * 0 + (10/16) * 0.72193 = 0.4512
$latex \textrm{Information Gain(IG) = E(Target) – E(Target,C) = 1 0.4512= 0.5488} $
Information gain for Var D
Var D has value >=1.4 for 5 records out of 16 and 11 records with value <5 value.
 For Var D >= 1.4 & class == positive: 0/5
 For Var D >= 1.4 & class == negative: 5/5
 Entropy(0,5) = 0
 For Var D < 1.4 & class == positive: 8/11
 For Var D < 14 & class == negative: 3/11
 Entropy(8,3) = 1 * ( (8/11)*log2(8/11) + (3/11)*log2(3/11)) = 0.84532
Entropy(Target, D) = P(>=1.4) * E(0,5) + P(< 1.4) * E(8,3)
= 5/16 * 0 + (11/16) * 0.84532 = 0.5811575
$latex \textrm{Information Gain(IG) = E(Target) – E(Target,D) = 1 0.5811575 = 0.41189} $





From the above Information Gain calculations, we can build a decision tree. We should place the attributes on the tree according to their values.
An Attribute with better value than other should position as root and A branch with entropy 0 should be converted to a leaf node. A branch with entropy more than 0 needs further splitting.
Gini Index
Gini Index is a metric to measure how often a randomly chosen element would be incorrectly identified. It means an attribute with lower gini index should be preferred.
Example: Construct a Decision Tree by using “gini index” as a criterion
We are going to use same data sample that we used for information gain example. Let’s try to use gini index as a criterion. Here, we have 5 columns out of which 4 columns have continuous data and 5th column consists of class labels.
A, B, C, D attributes can be considered as predictors and E column class labels can be considered as a target variable. For constructing a decision tree from this data, we have to convert continuous data into categorical data.
We have chosen some random values to categorize each attribute:
A  B  C  D 
>= 5  >= 3.0  >=4.2  >= 1.4 
< 5  < 3.0  < 4.2  < 1.4 
Gini Index for Var A
Var A has value >=5 for 12 records out of 16 and 4 records with value <5 value.
 For Var A >= 5 & class == positive: 5/12
 For Var A >= 5 & class == negative: 7/12
 gini(5,7) = 1 ( (5/12)2 + (7/12)2 ) = 0.4860
 For Var A <5 & class == positive: 3/4
 For Var A <5 & class == negative: 1/4
 gini(3,1) = 1 ( (3/4)2 + (1/4)2 ) = 0.375
By adding weight and sum each of the gini indices:
$latex \textrm{gini(Target, A) = (12/16) * (0.486) + (4/16) * (0.375) = 0.45825}$
Gini Index for Var B
Var B has value >=3 for 12 records out of 16 and 4 records with value <5 value.
 For Var B >= 3 & class == positive: 8/12
 For Var B >= 3 & class == negative: 4/12
 gini(8,4) = 1 ( (8/12)2 + (4/12)2 ) = 0.446
 For Var B <3 & class == positive: 0/4
 For Var B <3 & class == negative: 4/4
 gin(0,4) = 1 ( (0/4)2 + (4/4)2 ) = 0
$latex \textrm{gini(Target, B) = (12/16) * 0.446 + (4/16) * 0 = 0.3345}$
Gini Index for Var C
Var C has value >=4.2 for 6 records out of 16 and 10 records with value <4.2 value.
 For Var C >= 4.2 & class == positive: 0/6
 For Var C >= 4.2 & class == negative: 6/6
 gini(0,6) = 1 ( (0/8)2 + (6/6)2 ) = 0
 For Var C < 4.2& class == positive: 8/10
 For Var C < 4.2 & class == negative: 2/10
 gin(8,2) = 1 ( (8/10)2 + (2/10)2 ) = 0.32
$latex \textrm{gini(Target, C) = (6/16) * 0+ (10/16) * 0.32 = 0.2} $
Gini Index for Var D
Var D has value >=1.4 for 5 records out of 16 and 11 records with value <1.4 value.
 For Var D >= 1.4 & class == positive: 0/5
 For Var D >= 1.4 & class == negative: 5/5
 gini(0,5) = 1 ( (0/5)2 + (5/5)2 ) = 0
 For Var D < 1.4 & class == positive: 8/11
 For Var D < 1.4 & class == negative: 3/11
 gin(8,3) = 1 ( (8/11)2 + (3/11)2 ) = 0.397
$latex \textrm{ gini(Target, D) = (5/16) * 0+ (11/16) * 0.397 = 0.273} $





Overfitting
Overfitting is a practical problem while building a decision tree model. The model is having an issue of overfitting is considered when the algorithm continues to go deeper and deeper in the to reduce the training set error but results with an increased test set error i.e, Accuracy of prediction for our model goes down. It generally happens when it builds many branches due to outliers and irregularities in data.
Two approaches which we can use to avoid overfitting are:
 PrePruning
 PostPruning
PrePruning
In prepruning, it stops the tree construction bit early. It is preferred not to split a node if its goodness measure is below a threshold value. But it’s difficult to choose an appropriate stopping point.
PostPruning
In postpruning first, it goes deeper and deeper in the tree to build a complete tree. If the tree shows the overfitting problem then pruning is done as a postpruning step. We use a crossvalidation data to check the effect of our pruning. Using crossvalidation data, it tests whether expanding a node will make an improvement or not.
If it shows an improvement, then we can continue by expanding that node. But if it shows a reduction in accuracy then it should not be expanded i.e, the node should be converted to a leaf node.
Decision Tree Algorithm Advantages and Disadvantages
Advantages:
 Decision Trees are easy to explain. It results in a set of rules.
 It follows the same approach as humans generally follow while making decisions.
 Interpretation of a complex Decision Tree model can be simplified by its visualizations. Even a naive person can understand logic.
 The Number of hyperparameters to be tuned is almost null.
Disadvantages:
 There is a high probability of overfitting in Decision Tree.
 Generally, it gives low prediction accuracy for a dataset as compared to other machine learning algorithms.
 Information gain in a decision tree with categorical variables gives a biased response for attributes with greater no. of categories.
 Calculations can become complex when there are many class labels.
Follow us:
FACEBOOK QUORA TWITTER GOOGLE+  LINKEDIN REDDIT  FLIPBOARD  MEDIUM  GITHUB
I hope you like this post. If you have any questions, then feel free to comment below. If you want me to write on one particular topic, then do tell it to me in the comments below.
Related Courses:
Do check out unlimited data science courses
Title & links  Details  What You Will Learn 
Machine Learning AZ: HandsOn Python & R In Data Science  Course Overall Rating:: 4.6 

Machine Learning: Classification  Course Overall Rating:: 4.7 

Practical Machine Learning 
Course Overall Rating:: 4.4 

Calculation for this entropy will be zero but you have told it will be 1.
I am really confused whether I’m calculating it correctly or not. Kindly let me know
1*( (8/16)*log2(8/16)) + (8/16) * log2(8/16) )
Let’s break down the entropy calculation you’ve provided:
1 * frac{8}{16} log_2 frac{8}{16} + frac{8}{16} log_2 frac{8}{16}
Given:
frac{8}{16} = 0.5
log_2(0.5) = 1
Substitute the values into the equation:
1 * (0.5 * (1) + 0.5 * (1) )
= 1 * ( 0.5 – 0.5 )
= 1 * (1)
= 1
So the calculation you provided is indeed equal to 1.
Hi, thank you so much for your valuable post. Please, could you explain at what basis you decided B<3 as a negative leaf node? I have read your reply in the above comments but still, it's unclear to me. Please, could you help out?
the second question is: what if we suppose at B<3 "positive" is not zero(0) then what would be the leaf node in that case?
Hi Malik,
Question 1: If you clearly observe the information gain calculation in the post, for B<3 case we don’t have any positive class whereas we are having 4 negative cases, that’s the main reason for deciding for case B<3 is negative.
Question 2: For the second question if B<3 is , not zero we can get some values for positive and some values for negative cases then we will try to split the node again if we can’t split the node, then the target class which is having the max value will be treated as the leaf node.
Thanks! and happy learning!
Which Gini index value will be split ? Highest Gini index or lowest Gini index ?
Hi Riya,
Which has highest Gini index will considering for splitting.
Thanks and happy learning.
Which Gini index Value will be split in tree ? Highest Gini index or lowest Gini index ?
Plz help me with this.
Hi Riya,
Which has the highest Gini index will considering for splitting.
Thanks and happy learning.
Hi!
I want to know the difference algorithm between CART, C4.5 and ID3 for implementing them in python.
Hi FSJ,
Thanks for suggesting new topic to write, We will write an article that explains the implementation level difference between CART, C4.5, and ID3.
Happy learning!
Very good explanation, thank you!
But I still have a problem with the root node selection. In the case of all the features with same quantity of different data (say 2 for yes/no), then selection is clearly up to you, but if you have (say 3 for yes/no/maybe) clearly you will select the feature with most entropy first (therefore the one with 3), is this true?
Hi Higinio,
Yes you are correct that’s how the root selection works.
Thanks and happy learning.
Hi Saimadhu,
In both cases, IG and GI, column A is ignored or omitted. is it due to the Threshold that algorithm ignored it as prepruning ?
Hi Vishnu,
They are ignored because the model has not found any importance in keeping as features in the decision tree model.
Thanks and happy learning.
You cannot say that Gini works only for continuous. This is simply not true! A simple example is CART: Gini index is a primary decision maker.
Hi Emin,
Yes, you are correct we can’t say gini works best for continuous data all the time, but compared to the other selection methods works well for the categorical data, gini method is the best to try out for continuous data.
Thanks and happy learning!
Thanks for such nice and precise explanation
Hi Manish bhanu,
Thanks for the compliment.
Happy learning.
Hi,
As per information gain we have slected variable B and as per the Gini index we selected variable C. So how do we decide which metrics to choose to decide on the attribute as root node.
Hi Sashikant Dwivedi,
Selecting which feature selection method to consider is one of the key aspects of machine learning models. We can try different feature selection methods and pick the best selection method which is giving the high accuracy.
As you mentioned that high accuracy is best selection. Is it mean that highest Gini index is best selection ?
Hi Riya,
The highest Gini index is the best for splitting the nodes.
Thanks and happy learning.
1 * ( (8/12)*log2(8/12) + (4/12)*log2(4/12)) = 0.39054, this calculation is wrong
You are suppose to get somethign around 0.918
I get the same results !!! 0.918 , this should be corrected !!
Can I have the algorithm for decision tree ?
Yes,
Check out how to implement decision tree in python.
thank you. can you explain how does Decision Tree work with Regression ?
Hi Hieuiph,
Will work on a post to address how to use the decision tree for regression problem.
Hi,
You explained it well. I was able to understand how split criteria work.
I have some doubt:
1) When you have created decision tree using Information gain, then you have taken B as root node as it have highest information gain. My doubt is how you have decided that when B<3.0 class will be negative? why not positive and and how it will be leaf node.
2) Why D is not further split into A node?
3) When D=5 and A=3.0 and B<3. Is there any logic to choose these random value or randomly we just need to choose it. How R will choose it?
Can you please answer these question as it will be really helpful for my understanding.
Thanks & Regards,
Pradeep Tripathi
Hi Pradeep,
The split criteria limits will update based on the training dataset. The random values you were thinking were not random values 🙂 Programmatically they were computed using the training dataset. Let’s say from the above question how we have decided that B < 3.0 class will be negative The answer will be based using the split criteria like information gain or gini index we will get to know the node, then the next step is to find the threshold which determines the target class. To get the threshold value the program will continually be updated by considering the train data target value and by checking the present threshold will the target class is correctly classified, if not the threshold will be updated and again the same process of checking will happen. This will happen until we get good amount target classes was classified correctly using the threshold values at the root node. You don’t need to start coding all these you use the python or r package to compute the above things. You can check the implementing Decision tree in Python post to learn how we can do it programmatically.
Hi,
You explained it well.
I have some doubt:
1) When you have created decision tree using Information gain, then you have taken B as root node as it have highest information gain. My doubt is how you have decided that when B<3.0 class will be negative? why not positive and and how it will be leaf node.
2) Why D is not further split into A node?
3) When D=5 and A=3.0 and B<3. Is there any logic to choose these random value or randomly we just need to choose it. How R will choose it?
Can you please answer these question as it will be really helpful for my understanding.
Thanks & Regards,
Pradeep Tripathi