How Decision Tree Algorithm works

January 30, 2017 Rahul Saxena

Decision Tree Algorithm

Introduction to Decision Tree Algorithm

Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, decision tree algorithm can be used for solving regression and classification problems too.

The general motive of using Decision Tree is to create a training model which can use to predict class or value of target variables by learning decision rules inferred from prior data(training data).

The understanding level of Decision Trees algorithm is so easy compared with other classification algorithms. The decision tree algorithm tries to solve the problem, by using tree representation. Each internal node of the tree corresponds to an attribute, and each leaf node corresponds to a class label.

Decision Tree Algorithm Pseudocode

Place the best attribute of the dataset at the root of the tree.
Split the training set into subsets. Subsets should be made in such a way that each subset contains data with the same value for an attribute.
Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree.

Decision Tree classifier, Image credit: www.packtpub.com

In decision trees, for predicting a class label for a record we start from the root of the tree. We compare the values of the root attribute with record’s attribute. On the basis of comparison, we follow the branch corresponding to that value and jump to the next node.

We continue comparing our record’s attribute values with other internal nodes of the tree until we reach a leaf node with predicted class value. As we know how the modeled decision tree can be used to predict the target class or the value. Now let’s understanding how we can create the decision tree model.

Assumptions while creating Decision Tree

The below are the some of the assumptions we make while using Decision tree:

At the beginning, the whole training set is considered as the root.
Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.
Records are distributed recursively on the basis of attribute values.
Order to placing attributes as root or internal node of the tree is done by using some statistical approach.

Decision tree model example Image Credit: http://zhanpengfang.github.io/

Decision Trees follow Sum of Product (SOP) representation. For the above images, you can see how we can predict can we accept the new job offer? and Use computer daily? from traversing for the root node to the leaf node.

It’s a sum of product representation. The Sum of product(SOP) is also known as Disjunctive Normal Form. For a class, every branch from the root of the tree to a leaf node having the same class is a conjunction(product) of values, different branches ending in that class form a disjunction(sum).

The primary challenge in the decision tree implementation is to identify which attributes do we need to consider as the root node and each level. Handling this is know the attributes selection. We have different attributes selection measure to identify the attribute which can be considered as the root note at each level.

The popular attribute selection measures:

Information gain
Gini index

Attributes Selection

If dataset consists of “n” attributes then deciding which attribute to place at the root or at different levels of the tree as internal nodes is a complicated step. By just randomly selecting any node to be the root can’t solve the issue. If we follow a random approach, it may give us bad results with low accuracy.

For solving this attribute selection problem, researchers worked and devised some solutions. They suggested using some criterion like information gain, gini index, etc. These criterions will calculate values for every attribute. The values are sorted, and attributes are placed in the tree by following the order i.e, the attribute with a high value(in case of information gain) is placed at the root.

While using information Gain as a criterion, we assume attributes to be categorical, and for gini index, attributes are assumed to be continuous.

Information Gain

By using information gain as a criterion, we try to estimate the information contained by each attribute. We are going to use some points deducted from information theory.
To measure the randomness or uncertainty of a random variable X is defined by Entropy.

For a binary classification problem with only two classes, positive and negative class.

If all examples are positive or all are negative then entropy will be zero i.e, low.
If half of the records are of positive class and half are of negative class then entropy is one i.e, high.

$H(X)=\mathbb {E} _{X}[I(x)]=-\sum _{x\in \mathbb {X} }p(x)\log p(x).$

By calculating entropy measure of each attribute we can calculate their information gain. Information Gain calculates the expected reduction in entropy due to sorting on the attribute. Information gain can be calculated.

To get a clear understanding of calculating information gain & entropy, we will try to implement it on a sample data.

Example: Construct a Decision Tree by using “information gain” as a criterion

We are going to use this data sample. Let’s try to use information gain as a criterion. Here, we have 5 columns out of which 4 columns have continuous data and 5th column consists of class labels.

A, B, C, D attributes can be considered as predictors and E column class labels can be considered as a target variable. For constructing a decision tree from this data, we have to convert continuous data into categorical data.

We have chosen some random values to categorize each attribute:

A	B	C	D
>= 5	>= 3.0	>= 4.2	>= 1.4
< 5	< 3.0	< 4.2	< 1.4

There are 2 steps for calculating information gain for each attribute:

Calculate entropy of Target.
Entropy for every attribute A, B, C, D needs to be calculated. Using information gain formula we will subtract this entropy from the entropy of target. The result is Information Gain.

The entropy of Target: We have 8 records with negative class and 8 records with positive class. So, we can directly estimate the entropy of target as 1.

Variable E
Positive	Negative
8	8

Calculating entropy using formula:

E(8,8) = -1*( (p(+ve)*log( p(+ve)) + (p(-ve)*log( p(-ve)) )
= -1*( (8/16)*log₂(8/16)) + (8/16) * log₂(8/16) )
= 1

Information gain for Var A

Var A has value >=5 for 12 records out of 16 and 4 records with value <5 value.

For Var A >= 5 & class == positive: 5/12
For Var A >= 5 & class == negative: 7/12
- Entropy(5,7) = -1 * ( (5/12)*log2(5/12) + (7/12)*log2(7/12)) = 0.9799
For Var A <5 & class == positive: 3/4
For Var A <5 & class == negative: 1/4
- Entropy(3,1) = -1 * ( (3/4)*log2(3/4) + (1/4)*log2(1/4)) = 0.81128

Entropy(Target, A) = P(>=5) * E(5,7) + P(<5) * E(3,1)
= (12/16) * 0.9799 + (4/16) * 0.81128 = 0.937745

$latex \textrm{Information Gain(IG) = E(Target) – E(Target,A) = 1- 0.9337745 = 0.062255} $

Information gain for Var B

Var B has value >=3 for 12 records out of 16 and 4 records with value <5 value.

For Var B >= 3 & class == positive: 8/12
For Var B >= 3 & class == negative: 4/12
- Entropy(8,4) = -1 * ( (8/12)*log2(8/12) + (4/12)*log2(4/12)) = 0.39054
For VarB <3 & class == positive: 0/4
For Var B <3 & class == negative: 4/4
- Entropy(0,4) = -1 * ( (0/4)*log2(0/4) + (4/4)*log2(4/4)) = 0

Entropy(Target, B) = P(>=3) * E(8,4) + P(<3) * E(0,4)
= (12/16) * 0.39054 + (4/16) * 0 = 0.292905

$latex \textrm{Information Gain(IG) = E(Target) – E(Target,B) = 1- 0.292905= 0.707095} $

Information gain for Var C

Var C has value >=4.2 for 6 records out of 16 and 10 records with value <4.2 value.

For Var C >= 4.2 & class == positive: 0/6
For Var C >= 4.2 & class == negative: 6/6
- Entropy(0,6) = 0
For VarC < 4.2 & class == positive: 8/10
For Var C < 4.2 & class == negative: 2/10
- Entropy(8,2) = 0.72193

Entropy(Target, C) = P(>=4.2) * E(0,6) + P(< 4.2) * E(8,2)
= (6/16) * 0 + (10/16) * 0.72193 = 0.4512

$latex \textrm{Information Gain(IG) = E(Target) – E(Target,C) = 1- 0.4512= 0.5488} $

Information gain for Var D

Var D has value >=1.4 for 5 records out of 16 and 11 records with value <5 value.

For Var D >= 1.4 & class == positive: 0/5
For Var D >= 1.4 & class == negative: 5/5
- Entropy(0,5) = 0
For Var D < 1.4 & class == positive: 8/11
For Var D < 14 & class == negative: 3/11
- Entropy(8,3) = -1 * ( (8/11)*log2(8/11) + (3/11)*log2(3/11)) = 0.84532

Entropy(Target, D) = P(>=1.4) * E(0,5) + P(< 1.4) * E(8,3)
= 5/16 * 0 + (11/16) * 0.84532 = 0.5811575

$latex \textrm{Information Gain(IG) = E(Target) – E(Target,D) = 1- 0.5811575 = 0.41189} $

		Target
		Positive	Negative
A	>= 5.0	5	7
	<5	3	1
Information Gain of A = 0.062255

		Target
		Positive	Negative
B	>= 3.0	8	4
	< 3.0	0	4
Information Gain of B= 0.7070795

		Target
		Positive	Negative
C	>= 4.2	0	6
	< 4.2	8	2
Information Gain of C= 0.5488

		Target
		Positive	Negative
D	>= 1.4	0	5
	< 1.4	8	3
Information Gain of D= 0.41189

From the above Information Gain calculations, we can build a decision tree. We should place the attributes on the tree according to their values.

An Attribute with better value than other should position as root and A branch with entropy 0 should be converted to a leaf node. A branch with entropy more than 0 needs further splitting.

Gini Index

Gini Index is a metric to measure how often a randomly chosen element would be incorrectly identified. It means an attribute with lower gini index should be preferred.

Example: Construct a Decision Tree by using “gini index” as a criterion

We are going to use same data sample that we used for information gain example. Let’s try to use gini index as a criterion. Here, we have 5 columns out of which 4 columns have continuous data and 5th column consists of class labels.

We have chosen some random values to categorize each attribute:

A	B	C	D
>= 5	>= 3.0	>=4.2	>= 1.4
< 5	< 3.0	< 4.2	< 1.4

Gini Index for Var A

Var A has value >=5 for 12 records out of 16 and 4 records with value <5 value.

For Var A >= 5 & class == positive: 5/12
For Var A >= 5 & class == negative: 7/12
- gini(5,7) = 1- ( (5/12)2 + (7/12)2 ) = 0.4860
For Var A <5 & class == positive: 3/4
For Var A <5 & class == negative: 1/4
- gini(3,1) = 1- ( (3/4)2 + (1/4)2 ) = 0.375

By adding weight and sum each of the gini indices:

$latex \textrm{gini(Target, A) = (12/16) * (0.486) + (4/16) * (0.375) = 0.45825}$

Gini Index for Var B

Var B has value >=3 for 12 records out of 16 and 4 records with value <5 value.

For Var B >= 3 & class == positive: 8/12
For Var B >= 3 & class == negative: 4/12
- gini(8,4) = 1- ( (8/12)2 + (4/12)2 ) = 0.446
For Var B <3 & class == positive: 0/4
For Var B <3 & class == negative: 4/4
- gin(0,4) = 1- ( (0/4)2 + (4/4)2 ) = 0

$latex \textrm{gini(Target, B) = (12/16) * 0.446 + (4/16) * 0 = 0.3345}$

Gini Index for Var C

Var C has value >=4.2 for 6 records out of 16 and 10 records with value <4.2 value.

For Var C >= 4.2 & class == positive: 0/6
For Var C >= 4.2 & class == negative: 6/6
- gini(0,6) = 1- ( (0/8)2 + (6/6)2 ) = 0
For Var C < 4.2& class == positive: 8/10
For Var C < 4.2 & class == negative: 2/10
- gin(8,2) = 1- ( (8/10)2 + (2/10)2 ) = 0.32

$latex \textrm{gini(Target, C) = (6/16) * 0+ (10/16) * 0.32 = 0.2} $

Gini Index for Var D

Var D has value >=1.4 for 5 records out of 16 and 11 records with value <1.4 value.

For Var D >= 1.4 & class == positive: 0/5
For Var D >= 1.4 & class == negative: 5/5
- gini(0,5) = 1- ( (0/5)2 + (5/5)2 ) = 0
For Var D < 1.4 & class == positive: 8/11
For Var D < 1.4 & class == negative: 3/11
- gin(8,3) = 1- ( (8/11)2 + (3/11)2 ) = 0.397

$latex \textrm{ gini(Target, D) = (5/16) * 0+ (11/16) * 0.397 = 0.273} $

		wTarget
		Positive	Negative
A	>= 5.0	5	7
	<5	3	1
Ginin Index of A = 0.45825

		Target
		Positive	Negative
B	>= 3.0	8	4
	< 3.0	0	4
Gini Index of B= 0.3345

		Target
		Positive	Negative
C	>= 4.2	0	6
	< 4.2	8	2
Gini Index of C= 0.2

		Target
		Positive	Negative
D	>= 1.4	0	5
	< 1.4	8	3
Gini Index of D= 0.273

Overfitting

Overfitting is a practical problem while building a decision tree model. The model is having an issue of overfitting is considered when the algorithm continues to go deeper and deeper in the to reduce the training set error but results with an increased test set error i.e, Accuracy of prediction for our model goes down. It generally happens when it builds many branches due to outliers and irregularities in data.

Two approaches which we can use to avoid overfitting are:

Pre-Pruning
Post-Pruning

Pre-Pruning

In pre-pruning, it stops the tree construction bit early. It is preferred not to split a node if its goodness measure is below a threshold value. But it’s difficult to choose an appropriate stopping point.

Post-Pruning

In post-pruning first, it goes deeper and deeper in the tree to build a complete tree. If the tree shows the overfitting problem then pruning is done as a post-pruning step. We use a cross-validation data to check the effect of our pruning. Using cross-validation data, it tests whether expanding a node will make an improvement or not.

If it shows an improvement, then we can continue by expanding that node. But if it shows a reduction in accuracy then it should not be expanded i.e, the node should be converted to a leaf node.

Decision Tree Algorithm Advantages and Disadvantages

Advantages:

Decision Trees are easy to explain. It results in a set of rules.
It follows the same approach as humans generally follow while making decisions.
Interpretation of a complex Decision Tree model can be simplified by its visualizations. Even a naive person can understand logic.
The Number of hyper-parameters to be tuned is almost null.

Disadvantages:

There is a high probability of overfitting in Decision Tree.
Generally, it gives low prediction accuracy for a dataset as compared to other machine learning algorithms.
Information gain in a decision tree with categorical variables gives a biased response for attributes with greater no. of categories.
Calculations can become complex when there are many class labels.

Related Courses:

Do check out unlimited data science courses

Title & links	Details	What You Will Learn
Machine Learning A-Z: Hands-On Python & R In Data Science	Students Enrolled:: 19,359 Course Overall Rating:: 4.6	Master Machine Learning on Python & R Make robust Machine Learning models. Handle specific topics like Reinforcement Learning, NLP and Deep Learning. Build an army of powerful Machine Learning models and know how to combine them to solve any problem.
Machine Learning: Classification	Course Overall Rating:: 4.7	Describe the input and output of a classification model. Build a classification model to predict sentiment in a product review dataset. Analyze financial data to predict loan defaults. Evaluate your models using precision-recall metrics. Implement these techniques in Python (or in the language of your choice, though Python is highly recommended).
Practical Machine Learning	Course Overall Rating:: 4.4	This course will cover the basic components of building and applying prediction functions with an emphasis on practical applications. Will provide the basic grounding in concepts such as training and tests sets, overfitting, and error rates and introduce a range of model-based and algorithmic machine learning methods These methods including regression, classification trees, Naive Bayes, and random forests. The course will cover the complete process of building prediction functions including data collection, feature creation, algorithms, and evaluation.

31 Responses to “How Decision Tree Algorithm works”

Kaushal Kishor
11 months ago
Reply

Calculation for this entropy will be zero but you have told it will be 1.
I am really confused whether I’m calculating it correctly or not. Kindly let me know

-1*( (8/16)*log2(8/16)) + (8/16) * log2(8/16) )
- Saimadhu Polamuri
  6 months ago
  Reply
  
  Let’s break down the entropy calculation you’ve provided:
  
  -1 * frac{8}{16} log_2 frac{8}{16} + frac{8}{16} log_2 frac{8}{16}
  
  Given:
  frac{8}{16} = 0.5
  log_2(0.5) = -1
  
  Substitute the values into the equation:
  -1 * (0.5 * (-1) + 0.5 * (-1) )
  = -1 * ( -0.5 – 0.5 )
  = -1 * (-1)
  = 1
  
  So the calculation you provided is indeed equal to 1.
Malik
5 years ago
Reply

Hi, thank you so much for your valuable post. Please, could you explain at what basis you decided B<3 as a negative leaf node? I have read your reply in the above comments but still, it's unclear to me. Please, could you help out?
the second question is: what if we suppose at B<3 "positive" is not zero(0) then what would be the leaf node in that case?
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Malik,
  
  Question 1: If you clearly observe the information gain calculation in the post, for B<3 case we don’t have any positive class whereas we are having 4 negative cases, that’s the main reason for deciding for case B<3 is negative.
  
  Question 2: For the second question if B<3 is , not zero we can get some values for positive and some values for negative cases then we will try to split the node again if we can’t split the node, then the target class which is having the max value will be treated as the leaf node.
  
  Thanks! and happy learning!
Riya
5 years ago
Reply

Which Gini index value will be split ? Highest Gini index or lowest Gini index ?
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Riya,
  
  Which has highest Gini index will considering for splitting.
  
  Thanks and happy learning.
Riya
5 years ago
Reply

Which Gini index Value will be split in tree ? Highest Gini index or lowest Gini index ?
Plz help me with this.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Riya,
  
  Which has the highest Gini index will considering for splitting.
  
  Thanks and happy learning.
FSJ
5 years ago
Reply

Hi!
I want to know the difference algorithm between CART, C4.5 and ID3 for implementing them in python.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi FSJ,
  
  Thanks for suggesting new topic to write, We will write an article that explains the implementation level difference between CART, C4.5, and ID3.
  
  Happy learning!
Higinio
5 years ago
Reply

Very good explanation, thank you!
But I still have a problem with the root node selection. In the case of all the features with same quantity of different data (say 2 for yes/no), then selection is clearly up to you, but if you have (say 3 for yes/no/maybe) clearly you will select the feature with most entropy first (therefore the one with 3), is this true?
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Higinio,
  
  Yes you are correct that’s how the root selection works.
  
  Thanks and happy learning.
Vishnu
5 years ago
Reply

Hi Saimadhu,

In both cases, IG and GI, column A is ignored or omitted. is it due to the Threshold that algorithm ignored it as pre-pruning ?
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Vishnu,
  
  They are ignored because the model has not found any importance in keeping as features in the decision tree model.
  
  Thanks and happy learning.
Emin
5 years ago
Reply

You cannot say that Gini works only for continuous. This is simply not true! A simple example is CART: Gini index is a primary decision maker.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Emin,
  
  Yes, you are correct we can’t say gini works best for continuous data all the time, but compared to the other selection methods works well for the categorical data, gini method is the best to try out for continuous data.
  
  Thanks and happy learning!
manish bhanu
5 years ago
Reply

Thanks for such nice and precise explanation
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Manish bhanu,
  
  Thanks for the compliment.
  
  Happy learning.
sashikant dwivedi
6 years ago
Reply

Hi,

As per information gain we have slected variable B and as per the Gini index we selected variable C. So how do we decide which metrics to choose to decide on the attribute as root node.
- Saimadhu Polamuri
  6 years ago
  Reply
  
  Hi Sashikant Dwivedi,
  
  Selecting which feature selection method to consider is one of the key aspects of machine learning models. We can try different feature selection methods and pick the best selection method which is giving the high accuracy.
  - Riya
    5 years ago
    Reply
    
    As you mentioned that high accuracy is best selection. Is it mean that highest Gini index is best selection ?
    - Saimadhu Polamuri
      4 years ago
      Reply
      
      Hi Riya,
      
      The highest Gini index is the best for splitting the nodes.
      
      Thanks and happy learning.
Angel
7 years ago
Reply

-1 * ( (8/12)*log2(8/12) + (4/12)*log2(4/12)) = 0.39054, this calculation is wrong

You are suppose to get somethign around 0.918
- Christhian
  5 years ago
  Reply
  
  I get the same results !!! 0.918 , this should be corrected !!
Niteesha
7 years ago
Reply

Can I have the algorithm for decision tree ?
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Yes,
  Check out how to implement decision tree in python.
hieuiph
7 years ago
Reply

thank you. can you explain how does Decision Tree work with Regression ?
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Hieuiph,
  Will work on a post to address how to use the decision tree for regression problem.
Pradeep Tripathi
7 years ago
Reply

Hi,

You explained it well. I was able to understand how split criteria work.
I have some doubt:-

1) When you have created decision tree using Information gain, then you have taken B as root node as it have highest information gain. My doubt is how you have decided that when B<3.0 class will be negative? why not positive and and how it will be leaf node.
2) Why D is not further split into A node?
3) When D=5 and A=3.0 and B<3. Is there any logic to choose these random value or randomly we just need to choose it. How R will choose it?

Can you please answer these question as it will be really helpful for my understanding.

Thanks & Regards,
Pradeep Tripathi
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Pradeep,
  
  The split criteria limits will update based on the training dataset. The random values you were thinking were not random values 🙂 Programmatically they were computed using the training dataset. Let’s say from the above question how we have decided that B < 3.0 class will be negative The answer will be based using the split criteria like information gain or gini index we will get to know the node, then the next step is to find the threshold which determines the target class. To get the threshold value the program will continually be updated by considering the train data target value and by checking the present threshold will the target class is correctly classified, if not the threshold will be updated and again the same process of checking will happen. This will happen until we get good amount target classes was classified correctly using the threshold values at the root node. You don’t need to start coding all these you use the python or r package to compute the above things. You can check the implementing Decision tree in Python post to learn how we can do it programmatically.
Pradeep Tripathi
7 years ago
Reply

Hi,

You explained it well.
I have some doubt:-

1) When you have created decision tree using Information gain, then you have taken B as root node as it have highest information gain. My doubt is how you have decided that when B<3.0 class will be negative? why not positive and and how it will be leaf node.
2) Why D is not further split into A node?
3) When D=5 and A=3.0 and B<3. Is there any logic to choose these random value or randomly we just need to choose it. How R will choose it?

Can you please answer these question as it will be really helpful for my understanding.

Thanks & Regards,
Pradeep Tripathi