Building Decision Tree Algorithm in Python with scikit learn

Decision tree algorithm implementation in python

Decision tree algorithm in python

Decision Tree Algorithm implementation with scikit learn

One of the cutest and lovable supervised algorithms is Decision Tree Algorithm. It can be used for both the classification as well as regression purposes also.

As in the previous article how the decision tree algorithm works we have given the enough introduction to the working aspects of decision tree algorithm. In this article, we are going to build a decision tree classifier in python using scikit-learn machine learning packages for balance scale dataset.

The summarizing way of addressing this article is to explain how we can implement Decision Tree classifier on Balance scale data set. We will program our classifier in Python language and will use its sklearn library.

How we can implement Decision Tree classifier in Python with Scikit-learn Click To Tweet

Decision tree algorithm prerequisites

Before get start building the decision tree classifier in Python, please gain enough knowledge on how the decision tree algorithm works. If you don’t have the basic understanding of how the Decision Tree algorithm. You can spend some time on how the  Decision Tree Algorithm works article.

Once we completed modeling the Decision Tree classifier, we will use the trained model to predict whether the balance scale tip to the right or tip to the left or be balanced. The greatness of using  Sklearn is that. It provides the functionality to implement machine learning algorithms in a few lines of code.

Before get started let’s quickly look into the assumptions we make while creating the decision tree and the decision tree algorithm pseudocode.

Assumptions we make while using Decision tree

  • In the beginning, the whole training set is considered at the root.
  • Feature values are preferred to be categorical. If values are continuous then they are discretized prior to building the model.
  • Records are distributed recursively on the basis of attribute values.
  • Order to placing attributes as root or internal node of the tree is done by using some statistical approach.

Decision Tree Algorithm Pseudocode

  1. Place the best attribute of our dataset at the root of the tree.
  2. Split the training set into subsets. Subsets should be made in such a way that each subset contains data with the same value for an attribute.
  3. Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree.

While building our decision tree classifier, we can improve its accuracy by tuning it with different parameters. But this tuning should be done carefully since by doing this our algorithm can overfit on our training data & ultimately it will build bad generalization model.

Sklearn Library Installation

Python’s sklearn library holds tons of modules that help to build predictive models. It contains tools for data splitting, pre-processing, feature selection, tuning and supervised – unsupervised learning algorithms, etc. It is similar to Caret library in R programming.

For using it, we first need to install it. The best way to install data science libraries and its dependencies is by installing Anaconda package. You can also install only the most popular machine learning Python libraries.

Sklearn library provides us direct access to a different module for training our model with different machine learning algorithms like K-nearest neighbor classifier, Support vector machine classifier, decision tree, linear regression, etc.

Balance Scale Data Set Description

Balance Scale data set consists of 5 attributes, 4 as feature attributes and 1 as the target attribute. We will try to build a classifier for predicting the Class attribute. The index of target attribute is 1st.

1.: 3 (L, B, R)
2. Left-Weight: 5 (1, 2, 3, 4, 5)
3. Left-Distance: 5 (1, 2, 3, 4, 5)
4. Right-Weight: 5 (1, 2, 3, 4, 5)
5. Right-Distance: 5 (1, 2, 3, 4, 5)

Index Variable Name Variable Values
1. Class Name( Target Variable) “R” : balance scale tip to the right
“L” : balance scale tip to the left
“B” : balance scale be balanced
2. Left-Weight 1, 2, 3, 4, 5
3. Left-Distance 1, 2, 3, 4, 5
4. Right-Weight 1, 2, 3, 4, 5
5. Right-Distance 1, 2, 3, 4, 5

The above table shows all the details of data.

Balance Scale Problem Statement

The problem we are going to address is To model a classifier for evaluating balance tip’s direction.

Decision Tree classifier implementation in Python with sklearn Library

The modeled Decision Tree will compare the new records metrics with the prior records(training data) that correctly classified the balance scale’s tip direction.

Python packages used

  • NumPy
    • NumPy is a Numeric Python module. It provides fast mathematical functions.
    • Numpy provides robust data structures for efficient computation of multi-dimensional arrays & matrices.
    • We used numpy to read data files into numpy arrays and data manipulation.
  • Pandas
    • Provides DataFrame Object for data manipulation
    • Provides reading & writing data b/w different files.
    • DataFrames can hold different types data of multidimensional arrays.
  •  Scikit-Learn
    • It’s a machine learning library. It includes various machine learning algorithms.
    • We are using its
      • train_test_split,
      • DecisionTreeClassifier,
      • accuracy_score algorithms.

If you haven’t setup the machine learning setup in your system the below posts will helpful.

Python Machine learning setup in ubuntu

Python machine learning virtual environment setup

 

Importing Python Machine Learning Libraries

This section involves importing all the libraries we are going to use. We are importing numpy and sklearn train_test_split, DecisionTreeClassifier & accuracy_score modules.

import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

Numpy arrays and pandas dataframes will help us in manipulating data. As discussed above, sklearn is a machine learning library. The cross_validation’s train_test_split() method will help us by splitting data into train & test set.

The tree module will be used to build a Decision Tree Classifier. Accutacy_score module will be used to calculate accuracy metrics from the predicted class variables.

Data Import

For importing the data and manipulating it, we are going to use pandas dataframes. First of all, we need to download the dataset. You can download the dataset from here.  All the data values are separated by commas.

After downloading the data file, we will use Pandas read_csv() method to import data into pandas dataframe. Since our data is separated by commas “,” and there is no header in our data, so we will put header parameter’s value “None” and sep parameter’s value as  “,”.

balance_data = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/balance-scale/balance-scale.data',
                           sep= ',', header= None)

We are saving our data into “balance_data” dataframe.

For checking the length & dimensions of our dataframe, we can use len() method & “.shape”.

print "Dataset Lenght:: ", len(balance_data)
print "Dataset Shape:: ", balance_data.shape

Output:

Dataset Lenght::  625
Dataset Shape::  (625, 5)

We can print head .e, top 5 lines of our dataframe using head() method.

print "Dataset:: "
balance_data.head()
Dataset:: 

Output:

0 1 2 3 4
0 B 1 1 1 1
1 R 1 1 1 2
2 R 1 1 1 3
3 R 1 1 1 4
4 R 1 1 1 5

Data Slicing

Data slicing is a step to split data into train and test set. Training data set can be used specifically for our model building. Test dataset should not be mixed up while building model. Even during standardization, we should not standardize our test set.

X = balance_data.values[:, 1:5]
Y = balance_data.values[:,0]

The above snippet divides data into feature set & target set. The “X ” set consists of predictor variables. It consists of data from 2nd column to 5th column. The “Y” set consists of the outcome variable. It consists of data in the 1st column. We are using “.values” of numpy converting our dataframes into numpy arrays.

Let’s split our data into training and test set. We will use sklearn’s train_test_split() method.

X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100)

The above snippet will split data into training and test set. X_train, y_train are training data &  X_test, y_test belongs to the test dataset.

The parameter test_size is given value 0.3; it means test sets will be 30% of whole dataset  & training dataset’s size will be 70% of the entire dataset. random_state variable is a pseudo-random number generator state used for random sampling. If you want to replicate our results, then use the same value of random_state.

Decision Tree Training

Now we fit Decision tree algorithm on training data, predicting labels for validation dataset and printing the accuracy of the model using various parameters.

DecisionTreeClassifier(): This is the classifier function for DecisionTree. It is the main function for implementing the algorithms. Some important parameters are:

  • criterion: It defines the function to measure the quality of a split. Sklearn supports “gini” criteria for Gini Index & “entropy” for Information Gain. By default, it takes “gini” value.
  • splitter: It defines the strategy to choose the split at each node. Supports “best” value to choose the best split & “random” to choose the best random split. By default, it takes “best” value.
  • max_features: It defines the no. of features to consider when looking for the best split. We can input integer, float, string & None value.
    • If an integer is inputted then it considers that value as max features at each split.
    • If float value is taken then it shows the percentage of features at each split.
    • If “auto” or “sqrt” is taken then max_features=sqrt(n_features).
    • If “log2” is taken then max_features= log2(n_features).
    • If None, then max_features=n_features. By default, it takes “None” value.
  • max_depth: The max_depth parameter denotes maximum depth of the tree. It can take any integer value or None. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. By default, it takes “None” value.
  • min_samples_split: This tells above the minimum no. of samples reqd. to split an internal node. If an integer value is taken then consider min_samples_split as the minimum no. If float, then it shows percentage. By default, it takes “2” value.
  • min_samples_leaf: The minimum number of samples required to be at a leaf node. If an integer value is taken then consider min_samples_leaf as the minimum no. If float, then it shows percentage. By default, it takes “1” value.
  • max_leaf_nodes: It defines the maximum number of possible leaf nodes. If None then it takes an unlimited number of leaf nodes. By default, it takes “None” value.
  • min_impurity_split: It defines the threshold for early stopping tree growth. A node will split if its impurity is above the threshold otherwise it is a leaf.

Let’s build classifiers using criterion as gini index & information gain. We need to fit our classifier using fit(). We will plot our decision tree classifier’s visualization too.

Decision Tree Classifier with criterion gini index

clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,
                               max_depth=3, min_samples_leaf=5)
clf_gini.fit(X_train, y_train)

Output:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=5,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=100, splitter='best')

Decision Tree Classifier with criterion information gain

clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100,
 max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)

Output

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=5,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=100, splitter='best')


Prediction

Now, we have modeled 2 classifiers. One classifier with gini index & another one with information gain as the criterion. We are ready to predict classes for our test set. We can use predict() method. Let’s try to predict target variable for test set’s 1st record.

clf_gini.predict([[4, 4, 3, 3]])

Output

array(['R'], dtype=object)

This way we can predict class for a single record. It’s time to predict target variable for the whole test dataset.

Prediction for Decision Tree classifier with criterion as gini index

y_pred = clf_gini.predict(X_test)
y_pred

Output

array(['R', 'L', 'R', 'R', 'R', 'L', 'R', 'L', 'L', 'L', 'R', 'L', 'L',
       'L', 'R', 'L', 'R', 'L', 'L', 'R', 'L', 'R', 'L', 'L', 'R', 'L',
       'L', 'L', 'R', 'L', 'L', 'L', 'R', 'L', 'L', 'L', 'L', 'R', 'L',
       'L', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'L', 'R', 'L', 'R', 'R',
       'L', 'R', 'R', 'L', 'R', 'R', 'L', 'L', 'R', 'R', 'L', 'L', 'L',
       'L', 'L', 'R', 'R', 'L', 'L', 'R', 'R', 'L', 'R', 'L', 'R', 'R',
       'R', 'L', 'R', 'L', 'L', 'L', 'L', 'R', 'R', 'L', 'R', 'L', 'R',
       'R', 'L', 'L', 'L', 'R', 'R', 'L', 'L', 'L', 'R', 'L', 'R', 'R',
       'R', 'R', 'R', 'R', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'R', 'R',
       'R', 'R', 'R', 'L', 'R', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'R',
       'R', 'R', 'R', 'L', 'R', 'R', 'R', 'L', 'L', 'R', 'L', 'R', 'L',
       'R', 'L', 'L', 'R', 'L', 'L', 'R', 'L', 'R', 'L', 'R', 'R', 'R',
       'L', 'R', 'R', 'R', 'R', 'R', 'L', 'L', 'R', 'R', 'R', 'R', 'L',
       'R', 'R', 'R', 'L', 'R', 'L', 'L', 'L', 'L', 'R', 'R', 'L', 'R',
       'R', 'L', 'L', 'R', 'R', 'R'], dtype=object)

Prediction for Decision Tree classifier with criterion as information gain

y_pred_en = clf_entropy.predict(X_test)
y_pred_en

Output

array(['R', 'L', 'R', 'L', 'R', 'L', 'R', 'L', 'R', 'R', 'R', 'R', 'L',
       'L', 'R', 'L', 'R', 'L', 'L', 'R', 'L', 'R', 'L', 'L', 'R', 'L',
       'R', 'L', 'R', 'L', 'R', 'L', 'R', 'L', 'L', 'L', 'L', 'L', 'R',
       'L', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'L', 'R', 'L', 'L', 'R',
       'L', 'L', 'R', 'L', 'R', 'R', 'L', 'R', 'R', 'R', 'L', 'L', 'R',
       'L', 'L', 'R', 'L', 'L', 'L', 'R', 'R', 'L', 'R', 'L', 'R', 'R',
       'R', 'L', 'R', 'L', 'L', 'L', 'L', 'R', 'R', 'L', 'R', 'L', 'R',
       'R', 'L', 'L', 'L', 'R', 'R', 'L', 'L', 'L', 'R', 'L', 'L', 'R',
       'R', 'R', 'R', 'R', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'R', 'R',
       'L', 'R', 'R', 'L', 'R', 'R', 'R', 'L', 'L', 'L', 'L', 'L', 'R',
       'R', 'R', 'R', 'L', 'R', 'R', 'R', 'L', 'L', 'R', 'L', 'R', 'L',
       'R', 'L', 'R', 'R', 'L', 'L', 'R', 'L', 'R', 'R', 'R', 'R', 'R',
       'L', 'R', 'R', 'R', 'R', 'R', 'R', 'L', 'R', 'L', 'R', 'R', 'L',
       'R', 'L', 'R', 'L', 'R', 'L', 'L', 'L', 'L', 'L', 'R', 'R', 'R',
       'L', 'L', 'L', 'R', 'R', 'R'], dtype=object)

Calculating Accuracy Score

The function accuracy_score() will be used to print accuracy of Decision Tree algorithm. By accuracy, we mean the ratio of the correctly predicted data points to all the predicted data points. Accuracy as a metric helps to understand the effectiveness of our algorithm. It takes 4 parameters.

  • y_true,
  • y_pred,
  • normalize,
  • sample_weight.

Out of these 4, normalize & sample_weight are optional parameters. The parameter y_true  accepts an array of correct labels and y_pred takes an array of predicted labels that are returned by the classifier. It returns accuracy as a float value.

Accuracy for Decision Tree classifier with criterion as gini index

print "Accuracy is ", accuracy_score(y_test,y_pred)*100

Output

Accuracy is  73.4042553191

Accuracy for Decision Tree classifier with criterion as information gain

print "Accuracy is ", accuracy_score(y_test,y_pred_en)*100

Output

Accuracy is  70.7446808511

Conclusion

In this article, we have learned how to model the decision tree algorithm in Python using the Python machine learning library scikit-learn. In the process, we learned how to split the data into train and test dataset. To model decision tree classifier we used the information gain, and gini index split criteria. In the end, we calucalte the accuracy of these two decision tree models.

Follow us:

FACEBOOKQUORA |TWITTERGOOGLE+ | LINKEDINREDDIT FLIPBOARD | MEDIUM | GITHUB

I hope you like this post. If you have any questions, then feel free to comment below.  If you want me to write on one particular topic, then do tell it to me in the comments below.

Related Courses:

Do check out unlimited data science courses

Title of the course Course Link Course Link
Machine Learning: Classification
Machine Learning: Classification
  • Will learn the basic introduction to classification.
  • This course will introduce the most popular used classification algorithms.
  • You will get a chance to implement them python using the python machine learning libraries.

 

 Data Mining with Python: Classification and Regression
Data Mining with Python: Classification and Regression
  •  Understand the key concepts in data mining and will learn how to apply these concepts to solve the real world problems.
  • Will get hands on experience with python programming language.
  • Hands on experience with numpy, pandas, matplotlib libraries (Python libraries)
 Machine learning with Scikit-learn
Machine learning with Scikit-learn
  •  Load data into scikit-learn; Run many machine learning algorithms both for unsupervised and supervised data.
  • Assess model accuracy and performance
  • Being able to decide what’s the best model for every scenario.

24 Responses to “Building Decision Tree Algorithm in Python with scikit learn

  • Softbigs
    7 months ago

    This is a great blog post! I learned a lot from it.

    • Thank you for the positive feedback! I’m delighted to hear that you found the post informative. If there’s any other topic you’re curious about or if you have further questions, please don’t hesitate to ask. Always here to help and share knowledge! 😊

  • Mike Sishi
    5 years ago

    Thanks for interesting topic. Very comprehensive!

    How can I reverse-engineer this example. I.e., change target variable and consequently get feature variables to adjust themselves. Basically I want to set a target variable value and get all possible values of feature variables that can yield the target variable value.

  • Hi,
    txs for the tuto.
    I’d like to ask you a question about an error an get when running this part of the code:

    clf_gini = DecisionTreeClassifier(criterion = “gini”, random_state = 100, max_depth=3, min_samples_leaf=5)
    clf_gini.fit(X_train, y_train)

    I get this error:
    ValueError: setting an array element with a sequence.

    Can you help?
    tx

    • Hi Fana,

      I guess you are passing a series object as X_train, Could please check X_train before passing into the clf_gini.fit(X_train, y_train) method.

      Thanks and happy learning!

  • Nice article, and shows you the basics of decision tree algorithm, but resulting accuracy (70% or 73%) is extremely bad by any real standard. You can very easily get better results with 90+% accuracy using even the simplest neural network instead of this sub-optimal decision tree described here.

  • Hi, How to use decisionTree on a XML file where the word and featues on different files. if you want I can share you the file.

    • Hi Eklil,

      I would suggest, first try to convert the XML to proper CSV or excel file, then you don’t need to worry about anything and you can focus only on modeling part.

      Thanks and happy learning.

  • Hi Sai

    Does it make any difference if the attribute values are continuous?

    • Hi Eli,

      It was a great intuition to think about the continuous and categorical variables while modeling the machine learning models. When it comes to the decision tree, it matters a lot.

      As at each node level, we were using the feature value (Categorical or continuous value) to calculate the cutoff value and the final decision will depend on the cutoff value.

      In case of categorical variables, we will end up only a few cases where in case of continues variable we will be having a high set of values.

  • Jeet Kumar
    6 years ago

    You write: the “X ” set consists of predictor variables. It consists of data from 2nd column to 5th column. , then shouldn’t the code be: X = balance_data.values[:, 1:4], instead of X = balance_data.values[:, 1:5]? Assuming column 0 is the outcome data.

    • Hi Jeet Kumar,

      It’s was a question. As you said, we need to use the data from 2nd column to 5th column. The way to select this is by using the 1:5 as python will consider it as 1 to 4. You can check the python range function to get a much clear idea about it.

  • ehui bernard
    6 years ago

    hi thank you very much for your tutorial it is very interesting. i am computer science student and my project is about the network security. i want to build decision tree model to classify the different type of attack in KDD data set. i already got the training data set and the test data set. so my problem is how to build the model. can you help me please by sending the python ? thank you

    • Hi Ehui Bernard,

      Thanks for your compliment 🙂

      Sending you the python code won’t help you anything. Just try to read the different articles on how to implement different machine learning models and try to implement on your own. At the beginning, It seems difficult but believe me once you learned how to build the model by yourself it helps you a lot in your career.

  • Freaking awesome content ,thanks for this 🙂

  • Anshul Agrawal
    6 years ago

    Hi,
    I am trying to train a model where
    clf_gini = DecisionTreeClassifier(criterion = “gini”, random_state = 100)
    But I get an error of this type

    Traceback (most recent call last):
    File “Bankloan.py3”, line 18, in
    clf_gini.fit(X_train, y_train)
    File “/usr/local/lib/python3.5/dist-packages/sklearn/tree/tree.py”, line 790, in fit
    X_idx_sorted=X_idx_sorted)
    File “/usr/local/lib/python3.5/dist-packages/sklearn/tree/tree.py”, line 140, in fit
    check_classification_targets(y)
    File “/usr/local/lib/python3.5/dist-packages/sklearn/utils/multiclass.py”, line 172, in check_classification_targets
    raise ValueError(“Unknown label type: %r” % y_type)
    ValueError: Unknown label type: ‘unknown’

    Kindly help

    • Hi, Anshul Agrawal,

      Please check the target class label. The error is about the target class. Before you start modeling the decision tree, print the targets and the features so you can avoid these kinds of errors.

      Let me know if you face any issue.

  • Dheeraj
    7 years ago

    Hi,
    As per example, if I test i predict for [5,5,5,5] it should be B right as the correct answer is B. But I am getting “R”. I did not understand why? That is the reason it has a low accruracy score of 73?

    • Hi Dheeraj,
      Yes. You are correct. When the model you trained having the high accuracy then you can predict the current answer in most of the cases. Here you have mentioned that the accuracy of the trained model is 73%. May be this could be the reason you are predicting it as R instead of B.

      So please update the training with the related features and any other techniques to increase the model accuracy and do share the techniques you used to increase the accuracy. We would love to hear from you.

  • Avik Sadhu
    7 years ago

    I’ve used spyder 3.1.2 (python3.6) but visualization code didn’t work. It’s not throwing any error only no decision tree comes! Any help appreciated

    DecisionTreeClassifier(class_weight=None, criterion=’entropy’, max_depth=3,
    max_features=None, max_leaf_nodes=None, min_samples_leaf=5,
    min_samples_split=2, min_weight_fraction_leaf=0.0,
    presort=False, random_state=100, splitter=’best’)

Leave a Reply

Your email address will not be published. Required fields are marked *

>