Building Decision Tree Algorithm in Python with scikit learn
Decision Tree Algorithm implementation with scikit learn
One of the cutest and lovable supervised algorithms is Decision Tree Algorithm. It can be used for both the classification as well as regression purposes also.
As in the previous article how the decision tree algorithm works we have given the enough introduction to the working aspects of decision tree algorithm. In this article, we are going to build a decision tree classifier in python using scikit-learn machine learning packages for balance scale dataset.
The summarizing way of addressing this article is to explain how we can implement Decision Tree classifier on Balance scale data set. We will program our classifier in Python language and will use its sklearn library.
How we can implement Decision Tree classifier in Python with Scikit-learn Share on X
Decision tree algorithm prerequisites
Before get start building the decision tree classifier in Python, please gain enough knowledge on how the decision tree algorithm works. If you don’t have the basic understanding of how the Decision Tree algorithm. You can spend some time on how the Decision Tree Algorithm works article.
Once we completed modeling the Decision Tree classifier, we will use the trained model to predict whether the balance scale tip to the right or tip to the left or be balanced. The greatness of using Sklearn is that. It provides the functionality to implement machine learning algorithms in a few lines of code.
Before get started let’s quickly look into the assumptions we make while creating the decision tree and the decision tree algorithm pseudocode.
Assumptions we make while using Decision tree
- In the beginning, the whole training set is considered at the root.
- Feature values are preferred to be categorical. If values are continuous then they are discretized prior to building the model.
- Records are distributed recursively on the basis of attribute values.
- Order to placing attributes as root or internal node of the tree is done by using some statistical approach.
Decision Tree Algorithm Pseudocode
- Place the best attribute of our dataset at the root of the tree.
- Split the training set into subsets. Subsets should be made in such a way that each subset contains data with the same value for an attribute.
- Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree.
While building our decision tree classifier, we can improve its accuracy by tuning it with different parameters. But this tuning should be done carefully since by doing this our algorithm can overfit on our training data & ultimately it will build bad generalization model.
Sklearn Library Installation
Python’s sklearn library holds tons of modules that help to build predictive models. It contains tools for data splitting, pre-processing, feature selection, tuning and supervised – unsupervised learning algorithms, etc. It is similar to Caret library in R programming.
For using it, we first need to install it. The best way to install data science libraries and its dependencies is by installing Anaconda package. You can also install only the most popular machine learning Python libraries.
Sklearn library provides us direct access to a different module for training our model with different machine learning algorithms like K-nearest neighbor classifier, Support vector machine classifier, decision tree, linear regression, etc.
Balance Scale Data Set Description
Balance Scale data set consists of 5 attributes, 4 as feature attributes and 1 as the target attribute. We will try to build a classifier for predicting the Class attribute. The index of target attribute is 1st.
1.: 3 (L, B, R)
2. Left-Weight: 5 (1, 2, 3, 4, 5)
3. Left-Distance: 5 (1, 2, 3, 4, 5)
4. Right-Weight: 5 (1, 2, 3, 4, 5)
5. Right-Distance: 5 (1, 2, 3, 4, 5)
Index | Variable Name | Variable Values |
1. | Class Name( Target Variable) | “R” : balance scale tip to the right “L” : balance scale tip to the left “B” : balance scale be balanced |
2. | Left-Weight | 1, 2, 3, 4, 5 |
3. | Left-Distance | 1, 2, 3, 4, 5 |
4. | Right-Weight | 1, 2, 3, 4, 5 |
5. | Right-Distance | 1, 2, 3, 4, 5 |
The above table shows all the details of data.
Balance Scale Problem Statement
The problem we are going to address is To model a classifier for evaluating balance tip’s direction.
Decision Tree classifier implementation in Python with sklearn Library
The modeled Decision Tree will compare the new records metrics with the prior records(training data) that correctly classified the balance scale’s tip direction.
Python packages used
- NumPy
- NumPy is a Numeric Python module. It provides fast mathematical functions.
- Numpy provides robust data structures for efficient computation of multi-dimensional arrays & matrices.
- We used numpy to read data files into numpy arrays and data manipulation.
- Pandas
- Provides DataFrame Object for data manipulation
- Provides reading & writing data b/w different files.
- DataFrames can hold different types data of multidimensional arrays.
- Scikit-Learn
- It’s a machine learning library. It includes various machine learning algorithms.
- We are using its
- train_test_split,
- DecisionTreeClassifier,
- accuracy_score algorithms.
If you haven’t setup the machine learning setup in your system the below posts will helpful.
Importing Python Machine Learning Libraries
This section involves importing all the libraries we are going to use. We are importing numpy and sklearn train_test_split, DecisionTreeClassifier & accuracy_score modules.
import numpy as np import pandas as pd from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score from sklearn import tree
Numpy arrays and pandas dataframes will help us in manipulating data. As discussed above, sklearn is a machine learning library. The cross_validation’s train_test_split() method will help us by splitting data into train & test set.
The tree module will be used to build a Decision Tree Classifier. Accutacy_score module will be used to calculate accuracy metrics from the predicted class variables.
Data Import
For importing the data and manipulating it, we are going to use pandas dataframes. First of all, we need to download the dataset. You can download the dataset from here. All the data values are separated by commas.
After downloading the data file, we will use Pandas read_csv() method to import data into pandas dataframe. Since our data is separated by commas “,” and there is no header in our data, so we will put header parameter’s value “None” and sep parameter’s value as “,”.
balance_data = pd.read_csv( 'https://archive.ics.uci.edu/ml/machine-learning-databases/balance-scale/balance-scale.data', sep= ',', header= None)
We are saving our data into “balance_data” dataframe.
For checking the length & dimensions of our dataframe, we can use len() method & “.shape”.
print "Dataset Lenght:: ", len(balance_data) print "Dataset Shape:: ", balance_data.shape
Output:
Dataset Lenght:: 625 Dataset Shape:: (625, 5)
We can print head .e, top 5 lines of our dataframe using head() method.
print "Dataset:: " balance_data.head()
Output:
Decision Tree Classifier with criterion gini index
clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100, max_depth=3, min_samples_leaf=5) clf_gini.fit(X_train, y_train)
Output:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3, max_features=None, max_leaf_nodes=None, min_samples_leaf=5, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=100, splitter='best')
Decision Tree Classifier with criterion information gain
clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100, max_depth=3, min_samples_leaf=5) clf_entropy.fit(X_train, y_train)
Output
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3, max_features=None, max_leaf_nodes=None, min_samples_leaf=5, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=100, splitter='best')
Prediction
Now, we have modeled 2 classifiers. One classifier with gini index & another one with information gain as the criterion. We are ready to predict classes for our test set. We can use predict() method. Let’s try to predict target variable for test set’s 1st record.
clf_gini.predict([[4, 4, 3, 3]])
Output
This way we can predict class for a single record. It’s time to predict target variable for the whole test dataset.
Prediction for Decision Tree classifier with criterion as gini index
y_pred = clf_gini.predict(X_test) y_pred
Output
array(['R', 'L', 'R', 'R', 'R', 'L', 'R', 'L', 'L', 'L', 'R', 'L', 'L', 'L', 'R', 'L', 'R', 'L', 'L', 'R', 'L', 'R', 'L', 'L', 'R', 'L', 'L', 'L', 'R', 'L', 'L', 'L', 'R', 'L', 'L', 'L', 'L', 'R', 'L', 'L', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'L', 'R', 'L', 'R', 'R', 'L', 'R', 'R', 'L', 'R', 'R', 'L', 'L', 'R', 'R', 'L', 'L', 'L', 'L', 'L', 'R', 'R', 'L', 'L', 'R', 'R', 'L', 'R', 'L', 'R', 'R', 'R', 'L', 'R', 'L', 'L', 'L', 'L', 'R', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'L', 'L', 'R', 'R', 'L', 'L', 'L', 'R', 'L', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'R', 'R', 'R', 'R', 'R', 'L', 'R', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'R', 'R', 'R', 'R', 'L', 'R', 'R', 'R', 'L', 'L', 'R', 'L', 'R', 'L', 'R', 'L', 'L', 'R', 'L', 'L', 'R', 'L', 'R', 'L', 'R', 'R', 'R', 'L', 'R', 'R', 'R', 'R', 'R', 'L', 'L', 'R', 'R', 'R', 'R', 'L', 'R', 'R', 'R', 'L', 'R', 'L', 'L', 'L', 'L', 'R', 'R', 'L', 'R', 'R', 'L', 'L', 'R', 'R', 'R'], dtype=object)
Prediction for Decision Tree classifier with criterion as information gain
y_pred_en = clf_entropy.predict(X_test) y_pred_en
Output
array(['R', 'L', 'R', 'L', 'R', 'L', 'R', 'L', 'R', 'R', 'R', 'R', 'L', 'L', 'R', 'L', 'R', 'L', 'L', 'R', 'L', 'R', 'L', 'L', 'R', 'L', 'R', 'L', 'R', 'L', 'R', 'L', 'R', 'L', 'L', 'L', 'L', 'L', 'R', 'L', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'L', 'R', 'L', 'L', 'R', 'L', 'L', 'R', 'L', 'R', 'R', 'L', 'R', 'R', 'R', 'L', 'L', 'R', 'L', 'L', 'R', 'L', 'L', 'L', 'R', 'R', 'L', 'R', 'L', 'R', 'R', 'R', 'L', 'R', 'L', 'L', 'L', 'L', 'R', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'L', 'L', 'R', 'R', 'L', 'L', 'L', 'R', 'L', 'L', 'R', 'R', 'R', 'R', 'R', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'R', 'R', 'L', 'R', 'R', 'L', 'R', 'R', 'R', 'L', 'L', 'L', 'L', 'L', 'R', 'R', 'R', 'R', 'L', 'R', 'R', 'R', 'L', 'L', 'R', 'L', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'L', 'R', 'L', 'R', 'R', 'R', 'R', 'R', 'L', 'R', 'R', 'R', 'R', 'R', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'R', 'L', 'R', 'L', 'R', 'L', 'L', 'L', 'L', 'L', 'R', 'R', 'R', 'L', 'L', 'L', 'R', 'R', 'R'], dtype=object)
Calculating Accuracy Score
The function accuracy_score() will be used to print accuracy of Decision Tree algorithm. By accuracy, we mean the ratio of the correctly predicted data points to all the predicted data points. Accuracy as a metric helps to understand the effectiveness of our algorithm. It takes 4 parameters.
- y_true,
- y_pred,
- normalize,
- sample_weight.
Out of these 4, normalize & sample_weight are optional parameters. The parameter y_true accepts an array of correct labels and y_pred takes an array of predicted labels that are returned by the classifier. It returns accuracy as a float value.
Accuracy for Decision Tree classifier with criterion as gini index
print "Accuracy is ", accuracy_score(y_test,y_pred)*100
Output
Accuracy is 73.4042553191
Accuracy for Decision Tree classifier with criterion as information gain
print "Accuracy is ", accuracy_score(y_test,y_pred_en)*100
Output
Accuracy is 70.7446808511
Conclusion
In this article, we have learned how to model the decision tree algorithm in Python using the Python machine learning library scikit-learn. In the process, we learned how to split the data into train and test dataset. To model decision tree classifier we used the information gain, and gini index split criteria. In the end, we calucalte the accuracy of these two decision tree models.
Follow us:
FACEBOOK| QUORA |TWITTER| GOOGLE+ | LINKEDIN| REDDIT | FLIPBOARD | MEDIUM | GITHUB
I hope you like this post. If you have any questions, then feel free to comment below. If you want me to write on one particular topic, then do tell it to me in the comments below.
Related Courses:
Do check out unlimited data science courses
Title of the course | Course Link | Course Link |
Machine Learning: Classification |
Machine Learning: Classification |
|
Data Mining with Python: Classification and Regression |
Data Mining with Python: Classification and Regression |
|
Machine learning with Scikit-learn |
Machine learning with Scikit-learn |
|
This is a great blog post! I learned a lot from it.
Thank you for the positive feedback! I’m delighted to hear that you found the post informative. If there’s any other topic you’re curious about or if you have further questions, please don’t hesitate to ask. Always here to help and share knowledge! 😊
Thanks for interesting topic. Very comprehensive!
How can I reverse-engineer this example. I.e., change target variable and consequently get feature variables to adjust themselves. Basically I want to set a target variable value and get all possible values of feature variables that can yield the target variable value.
Hi Mike Sishi,
Thanks for the compliment.
We wish you a very happy learning.
Hi,
txs for the tuto.
I’d like to ask you a question about an error an get when running this part of the code:
clf_gini = DecisionTreeClassifier(criterion = “gini”, random_state = 100, max_depth=3, min_samples_leaf=5)
clf_gini.fit(X_train, y_train)
I get this error:
ValueError: setting an array element with a sequence.
Can you help?
tx
Hi Fana,
I guess you are passing a series object as X_train, Could please check X_train before passing into the clf_gini.fit(X_train, y_train) method.
Thanks and happy learning!
Nice article, and shows you the basics of decision tree algorithm, but resulting accuracy (70% or 73%) is extremely bad by any real standard. You can very easily get better results with 90+% accuracy using even the simplest neural network instead of this sub-optimal decision tree described here.
Hi, How to use decisionTree on a XML file where the word and featues on different files. if you want I can share you the file.
Hi Eklil,
I would suggest, first try to convert the XML to proper CSV or excel file, then you don’t need to worry about anything and you can focus only on modeling part.
Thanks and happy learning.
Hi Sai
Does it make any difference if the attribute values are continuous?
Hi Eli,
It was a great intuition to think about the continuous and categorical variables while modeling the machine learning models. When it comes to the decision tree, it matters a lot.
As at each node level, we were using the feature value (Categorical or continuous value) to calculate the cutoff value and the final decision will depend on the cutoff value.
In case of categorical variables, we will end up only a few cases where in case of continues variable we will be having a high set of values.
You write: the “X ” set consists of predictor variables. It consists of data from 2nd column to 5th column. , then shouldn’t the code be: X = balance_data.values[:, 1:4], instead of X = balance_data.values[:, 1:5]? Assuming column 0 is the outcome data.
Hi Jeet Kumar,
It’s was a question. As you said, we need to use the data from 2nd column to 5th column. The way to select this is by using the 1:5 as python will consider it as 1 to 4. You can check the python range function to get a much clear idea about it.
hi thank you very much for your tutorial it is very interesting. i am computer science student and my project is about the network security. i want to build decision tree model to classify the different type of attack in KDD data set. i already got the training data set and the test data set. so my problem is how to build the model. can you help me please by sending the python ? thank you
Hi Ehui Bernard,
Thanks for your compliment 🙂
Sending you the python code won’t help you anything. Just try to read the different articles on how to implement different machine learning models and try to implement on your own. At the beginning, It seems difficult but believe me once you learned how to build the model by yourself it helps you a lot in your career.
Freaking awesome content ,thanks for this 🙂
Hi Tushar,
Thanks for your compliment 🙂
Hi,
I am trying to train a model where
clf_gini = DecisionTreeClassifier(criterion = “gini”, random_state = 100)
But I get an error of this type
Traceback (most recent call last):
File “Bankloan.py3”, line 18, in
clf_gini.fit(X_train, y_train)
File “/usr/local/lib/python3.5/dist-packages/sklearn/tree/tree.py”, line 790, in fit
X_idx_sorted=X_idx_sorted)
File “/usr/local/lib/python3.5/dist-packages/sklearn/tree/tree.py”, line 140, in fit
check_classification_targets(y)
File “/usr/local/lib/python3.5/dist-packages/sklearn/utils/multiclass.py”, line 172, in check_classification_targets
raise ValueError(“Unknown label type: %r” % y_type)
ValueError: Unknown label type: ‘unknown’
Kindly help
Hi, Anshul Agrawal,
Please check the target class label. The error is about the target class. Before you start modeling the decision tree, print the targets and the features so you can avoid these kinds of errors.
Let me know if you face any issue.
Hi,
As per example, if I test i predict for [5,5,5,5] it should be B right as the correct answer is B. But I am getting “R”. I did not understand why? That is the reason it has a low accruracy score of 73?
Hi Dheeraj,
Yes. You are correct. When the model you trained having the high accuracy then you can predict the current answer in most of the cases. Here you have mentioned that the accuracy of the trained model is 73%. May be this could be the reason you are predicting it as R instead of B.
So please update the training with the related features and any other techniques to increase the model accuracy and do share the techniques you used to increase the accuracy. We would love to hear from you.
I’ve used spyder 3.1.2 (python3.6) but visualization code didn’t work. It’s not throwing any error only no decision tree comes! Any help appreciated
DecisionTreeClassifier(class_weight=None, criterion=’entropy’, max_depth=3,
max_features=None, max_leaf_nodes=None, min_samples_leaf=5,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=100, splitter=’best’)
Hi Avik Sadhu,
To visualize the decision tree we need use the export_graphviz. If you face any difficult on using the export_graphviz let me know, So I will write an article on how to visualize the decision tree.
Hi Avik Sadhu,
This article on visualizing the trained decision tree model will help you out.