Decision Tree Classifier implementation in R

February 3, 2017 Rahul Saxena

Decision Tree Classifier in R

Decision Tree Classifier implementation in R

The decision tree classifier is a supervised learning algorithm which can use for both the classification and regression tasks. As we have explained the building blocks of decision tree algorithm in our earlier articles. Now we are going to implement Decision Tree classifier in R using the R machine learning caret package.

To get more out of this article, it is recommended to learn about the decision tree algorithm. If you don’t have the basic understanding on Decision Tree classifier, it’s good to spend some time on understanding how the decision tree algorithm works.

Decision Tree Classifier implementation in R Click To Tweet

Why use the Caret Package

To work on big datasets, we can directly use some machine learning packages. The developer community of R programming language has built the great packages Caret to make our work easier. The beauty of these packages is that they are well optimized and can handle maximum exceptions to make our job simple. We just need to call functions for implementing algorithms with the right parameters.

Caret Package Installation

The R programming machine learning caret package( Classification And REgression Training) holds tons of functions that helps to build predictive models. It holds tools for data splitting, pre-processing, feature selection, tuning and supervised – unsupervised learning algorithms, etc. It is similar to the sklearn library in python.

For using it, we first need to install it. Open R console and install it by typing below command:

install.packages("caret")

The installed caret package provides us direct access to various functions for training our model with different machine learning algorithms like Knn, SVM, decision tree, linear regression, etc.

Cars Evaluation Data Set Description

The Cars Evaluation data set consists of 7 attributes, 6 as feature attributes and 1 as the target attribute. All the attributes are categorical. We will try to build a classifier for predicting the Class attribute. The index of target attribute is 7th.

1	buying	vhigh, high, med, low
2	maint	vhigh, high, med,low
3	doors	2, 3, 4, 5 , more
4	persons	2, 4, more
5	lug_boot	small, med, big.
6	safety	low, med, high
7	Car Evaluation – Target Variable	unacc, acc, good, vgood

The above table shows all the details of data.

Car Evaluation Problem Statement:

To model a classifier for evaluating the acceptability of car using its given features.

Decision Tree classifier implementation in R with Caret Package

R Library import

For implementing Decision Tree in r, we need to import “caret” package & “rplot.plot”. As we mentioned above, caret helps to perform various tasks for our machine learning work. The “rplot.plot” package will help to get a visual plot of the decision tree.

library(caret)
library(rpart.plot)

In case if you face any error while running the code. Frist install the package rplot.plot using the command install.packages(“rpart.plot”)

Data Import

For importing the data and manipulating it, we are going to use data frames. First of all, we need to download the dataset. You can download the dataset from here. All the data values are separated by commas. After downloading the data file, you need to set your working directory via console else save the data file in your current working directory.

You can get the path of your current working directory by running getwd() command in R console. If you wish to change your working directory then the setwd(<PATH of New Working Directory>) can complete our task.

data_url <- c("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data")
download.file(url = data_url, destfile = "car.data")

car_df <- read.csv("car.data", sep = ',', header = FALSE)

For importing data into an R data frame, we can use read.csv() method with parameters as a file name and whether our dataset consists of the 1st row with a header or not. If a header row exists then, the header should be set TRUE else header should set to FALSE.

For checking the structure of data frame we can call the function str() over car_df:

> str(car_df)
'data.frame':	1728 obs. of  7 variables:
 $ V1: Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ V2: Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ V3: Factor w/ 4 levels "2","3","4","5more": 1 1 1 1 1 1 1 1 1 1 ...
 $ V4: Factor w/ 3 levels "2","4","more": 1 1 1 1 1 1 1 1 1 2 ...
 $ V5: Factor w/ 3 levels "big","med","small": 3 3 3 2 2 2 1 1 1 3 ...
 $ V6: Factor w/ 3 levels "high","low","med": 2 3 1 2 3 1 2 3 1 2 ...
 $ V7: Factor w/ 4 levels "acc","good","unacc",..: 3 3 3 3 3 3 3 3 3 3 ...

The above output shows us that our dataset consists of 1728 observations each with 7 attributes.

To check top 5-6 rows of the dataset, we can use head().

> head(car_df)
     V1    V2 V3 V4    V5   V6    V7
1 vhigh vhigh  2  2 small  low unacc
2 vhigh vhigh  2  2 small  med unacc
3 vhigh vhigh  2  2 small high unacc
4 vhigh vhigh  2  2   med  low unacc
5 vhigh vhigh  2  2   med  med unacc
6 vhigh vhigh  2  2   med high unacc

All the features are categorical, so normalization of data is not needed.

Data Slicing

Data slicing is a step to split data into train and test set. Training data set can be used specifically for our model building. Test dataset should not be mixed up while building model. Even during standardization, we should not standardize our test set.

set.seed(3033)
intrain <- createDataPartition(y = car_df$V7, p= 0.7, list = FALSE)
training <- car_df[intrain,]
testing <- car_df[-intrain,]

The set.seed() method is used to make our work replicable. As we want our readers to learn concepts by coding these snippets. To make our answers replicable, we need to set a seed value. During partitioning of data, it splits randomly but if our readers will pass the same value in the set.seed() method. Then we both will get identical results.

The caret package provides a method createDataPartition() for partitioning our data into train and test set. We are passing 3 parameters. The “y” parameter takes the value of variable according to which data needs to be partitioned. In our case, target variable is at V7, so we are passing car_df$V7 (heart data frame’s V7 column).

The “p” parameter holds a decimal value in the range of 0-1. It’s to show that percentage of the split. We are using p=0.7. It means that data split should be done in 70:30 ratio. The “list” parameter is for whether to return a list or matrix. We are passing FALSE for not returning a list. The createDataPartition() method is returning a matrix “intrain” with record’s indices.

By passing values of intrain, we are splitting training data and testing data.
The line training <- car_df[intrain,] is for putting the data from data frame to training data. Remaining data is saved in the testing data frame, testing <- car_df[-intrain,]

For checking the dimensions of our training data frame and testing data frame, we can use these:

#check dimensions of train & test set
dim(training); dim(testing);

Preprocessing & Training

Preprocessing is all about correcting the problems in data before building a machine learning model using that data. Problems can be of many types like missing values, attributes with a different range, etc.

To check whether our data contains missing values or not, we can use anyNA() method. Here, NA means Not Available.

> anyNA(car_df)
[1] FALSE

Since it’s returning FALSE, it means we don’t have any missing values.

Dataset summarized details

For checking the summarized details of our data, we can use the summary() method. It will give us a basic idea about our dataset’s attributes range.

> summary(car_df)
 V1 V2 V3 V4 V5 V6 V7 
 high :432 high :432 2 :432 2 :576 big :576 high:576 acc : 384 
 low :432 low :432 3 :432 4 :576 med :576 low :576 good : 69 
 med :432 med :432 4 :432 more:576 small:576 med :576 unacc:1210 
 vhigh:432 vhigh:432 5more:432

Training the Decision Tree classifier with criterion as information gain

Caret package provides train() method for training our data for various algorithms. We just need to pass different parameter values for different algorithms. Before train() method, we will first use trainControl() method. It controls the computational nuances of the train() method.

trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
dtree_fit <- train(V7 ~., data = training, method = "rpart",
                   parms = list(split = "information"),
                   trControl=trctrl,
                   tuneLength = 10)

We are setting 3 parameters of trainControl() method. The “method” parameter holds the details about resampling method. We can set “method” with many values like “boot”, “boot632”, “cv”, “repeatedcv”, “LOOCV”, “LGOCV” etc. For this tutorial, let’s try to use repeatedcv i.e, repeated cross-validation.

The “number” parameter holds the number of resampling iterations. The “repeats ” parameter contains the complete sets of folds to compute for our repeated cross-validation. We are using setting number =10 and repeats =3. This trainControl() methods returns a list. We are going to pass this on our train() method.

Before training our Decision Tree classifier, set.seed().

For training Decision Tree classifier, train() method should be passed with “method” parameter as “rpart”. There is another package “rpart”, it is specifically available for decision tree implementation. Caret links its train function with others to make our work simple.

We are passing our target variable V7. The “V7~.” denotes a formula for using all attributes in our classifier and V7 as the target variable. The “trControl” parameter should be passed with results from our trianControl() method.

You can check the documentation rpart by typing ?rpart . We can use different criterions while splitting our nodes of the tree.

To select the specific strategy, we need to pass a parameter “parms” in our train() method. It should contain a list of parameters for our rpart method. For splitting criterions, we need to add a “split” parameter with values either “information” for information gain & “gini” for gini index. In the above snippet, we are using information gain as a criterion.

Trained Decision Tree classifier results

We can check the result of our train() method by a print dtree_fit variable. It is showing us the accuracy metrics for different values of cp. Here, cp is complexity parameter for our dtree.

> dtree_fit
CART 

1212 samples
   6 predictor
   4 classes: 'acc', 'good', 'unacc', 'vgood' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 1091, 1090, 1091, 1092, 1091, 1091, ... 
Resampling results across tuning parameters:

  cp          Accuracy   Kappa    
  0.01123596  0.8600447  0.6992474
  0.01404494  0.8487633  0.6710345
  0.01896067  0.8309266  0.6307181
  0.01966292  0.8295492  0.6284956
  0.02247191  0.8130381  0.5930024
  0.02387640  0.8116674  0.5904830
  0.05337079  0.7772599  0.5472383
  0.06179775  0.7745300  0.5470675
  0.07584270  0.7467212  0.3945498
  0.08426966  0.7202717  0.1922830

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was cp = 0.01123596.

Plot Decision Tree

We can visualize our decision tree by using prp() method.

prp(dtree_fit$finalModel, box.palette = "Reds", tweak = 1.2)

The decision tree visualization shown above indicates its structure. It shows the attribute’s selection order for criterion as information gain.

Prediction

Now, our model is trained with cp = 0.01123596. We are ready to predict classes for our test set. We can use predict() method. Let’s try to predict target variable for test set’s 1st record.

> testing[1,]
 V1 V2 V3 V4 V5 V6 V7
2 vhigh vhigh 2 2 small med unacc

> predict(dtree_fit, newdata = testing[1,])
[1] unacc
Levels: acc good unacc vgood

For our 1st record of testing data classifier is predicting class variable as “unacc”. Now, its time to predict target variable for the whole test set.

> test_pred <- predict(dtree_fit, newdata = testing)
> confusionMatrix(test_pred, testing$V7 )  #check accuracy
Confusion Matrix and Statistics

          Reference
Prediction acc good unacc vgood
     acc   102   19    36     3
     good    6    4     0     3
     unacc   5    0   318     0
     vgood  11    1     0     8

Overall Statistics
                                         
               Accuracy : 0.8372         
                 95% CI : (0.8025, 0.868)
    No Information Rate : 0.686          
    P-Value [Acc > NIR] : 3.262e-15      
                                         
                  Kappa : 0.6703         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: acc Class: good Class: unacc Class: vgood
Sensitivity              0.8226    0.166667       0.8983      0.57143
Specificity              0.8520    0.981707       0.9691      0.97610
Pos Pred Value           0.6375    0.307692       0.9845      0.40000
Neg Pred Value           0.9382    0.960239       0.8135      0.98790
Prevalence               0.2403    0.046512       0.6860      0.02713
Detection Rate           0.1977    0.007752       0.6163      0.01550
Detection Prevalence     0.3101    0.025194       0.6260      0.03876
Balanced Accuracy        0.8373    0.574187       0.9337      0.77376

The above results show that the classifier with the criterion as information gain is giving 83.72% of accuracy for the test set.

Training the Decision Tree classifier with criterion as gini index

Let’s try to program a decision tree classifier using splitting criterion as gini index. It is showing us the accuracy metrics for different values of cp. Here, cp is complexity parameter for our dtree.

> set.seed(3333)
> dtree_fit_gini <- train(V7 ~., data = training, method = "rpart",
                   parms = list(split = "gini"),
                   trControl=trctrl,
                   tuneLength = 10)
> dtree_fit_gini
CART 

1212 samples
 6 predictor
 4 classes: 'acc', 'good', 'unacc', 'vgood' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 1091, 1090, 1091, 1092, 1091, 1091, ... 
Resampling results across tuning parameters:

 cp Accuracy Kappa 
 0.01123596 0.8600222 0.6966316
 0.01404494 0.8493028 0.6704178
 0.01896067 0.8055473 0.5650697
 0.01966292 0.8022415 0.5587148
 0.02247191 0.7885257 0.5254510
 0.02387640 0.7874283 0.5242579
 0.05337079 0.7780797 0.5286806
 0.06179775 0.7739632 0.5354177
 0.07584270 0.7467212 0.3945498
 0.08426966 0.7202717 0.1922830

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.01123596.

Plot Decision Tree

We can visualize our decision tree by using prp() method.

> prp(dtree_fit_gini$finalModel, box.palette = "Blues", tweak = 1.2)

Prediction

Now, our model is trained with cp = 0.01123596. We are ready to predict classes for our test set.
Now, it’s time to predict target variable for the whole test set.

> test_pred_gini <- predict(dtree_fit_gini, newdata = testing)
> confusionMatrix(test_pred_gini, testing$V7 )  #check accuracy
Confusion Matrix and Statistics

          Reference
Prediction acc good unacc vgood
     acc   109   16    34     6
     good    5    7     0     0
     unacc   7    0   320     0
     vgood   3    1     0     8

Overall Statistics
                                          
               Accuracy : 0.8605          
                 95% CI : (0.8275, 0.8892)
    No Information Rate : 0.686           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7133          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: acc Class: good Class: unacc Class: vgood
Sensitivity              0.8790     0.29167       0.9040      0.57143
Specificity              0.8571     0.98984       0.9568      0.99203
Pos Pred Value           0.6606     0.58333       0.9786      0.66667
Neg Pred Value           0.9573     0.96627       0.8201      0.98810
Prevalence               0.2403     0.04651       0.6860      0.02713
Detection Rate           0.2112     0.01357       0.6202      0.01550
Detection Prevalence     0.3198     0.02326       0.6337      0.02326
Balanced Accuracy        0.8681     0.64075       0.9304      0.78173

The above results show that the classifier with the criterion as gini index is giving 86.05% of accuracy for the test set. In this case, our classifier with criterion gini index is giving better results.

Related Courses:

Do check out unlimited data science courses

Title & links	Details	What You Will Learn
Machine Learning A-Z: Hands-On Python & R In Data Science	Students Enrolled :: 19,359 Course Overall Rating:: 4.6	Master Machine Learning on Python & R Make robust Machine Learning models. Handle specific topics like Reinforcement Learning, NLP and Deep Learning. Build an army of powerful Machine Learning models and know how to combine them to solve any problem.
R Programming A-Z: R For Data Science With Real Exercises!	Students Enrolled :: 12,001 Course Overall Rating:: 4.6	Program in R at a good level. Learn the core principles of programming. Understand the Normal distribution. Practice working with statistical, financial and sport data in R
Data Mining with R: Go from Beginner to Advanced!	Students Enrolled :: 2,380 Course Overall Rating:: 4.2	Use R software for data import and export, data exploration and visualization, and for data analysis tasks, including performing a comprehensive set of data mining operations. Apply the dozens of included “hands-on” cases and examples using real data and R scripts to new and unique data analysis and data mining problems. Effectively use a number of popular, contemporary data mining methods and techniques in demand by industry including: (1) Decision, classification and regression trees (CART); (2) Random forests; (3) Linear and logistic regression; and (4) Various cluster analysis techniques.

14 Responses to “Decision Tree Classifier implementation in R”

Sakshie Pathak
4 years ago
Reply

Great article. Thanks for sharing. Would appreciate more articles for various algorithms from you.
ARUN
5 years ago
Reply

Please provide the codes for the Pruning of Decision Tree.
Le the technique used be the ones currently used in industries.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Arun,
  
  We are not covering pruning the decision tree as the model will never include the nonimportant nodes while building the model with proper parameter tuning. We will write an article on how to do that.
  
  Thanks and happy learning!
Venkat Ramakrishnan
6 years ago
Reply

It would be nice if you could describe when to pick Gini and when to pick information gain. That would add more value to the article.

Likewise, a little bit of detail about the various parameters passed to the relevant functions.

Overall, a very good article!
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Venkat,
  
  Thank you I’m happy to hear that, Will include when to used gini over information gain in this article.
  
  Thanks and happy learning!
Srikanth
6 years ago
Reply

This is merely an USAGE of decision tree IMPLEMENTATION from ‘rpart’ (package on CRAN) and not an IMPLEMENTATION by itself.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Srikanth,
  
  Yes, you are correct. The article’s intention is to teach how to build the decision tree classifier using the R packages. We don’t need to invent the well all the time Srikanth 🙂
  
  Thanks and happy learning!
Randy
7 years ago
Reply

Thanks, useful and well written. However, under Data Slicing perhaps a small (and insignificant) typo. The code contains car_df$V1 but the description below indicates car_df$V7 as the parameter.

Perhaps meaningless, all in all a very good tutorial.
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Randy,
  
  Thanks for your compliment 🙂
  
  In the code, it has to be car_df$V7 as car_df$V7 is the target variable. Thanks for knowing the typo error. I have updated the article.
Vikram
7 years ago
Reply

Getting the error… there is no package called ‘rplot.plot’.
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Vikram,
  
  It’s a typo error. Use rpart.plot instead of rplot.plot. We update this change in the article as well.
Demangel
7 years ago
Reply

Hi,

Thanks for the course. There’s just a little mistake about the package : it’s rpart.plot instead of rplot.plot.
Best regards.
Eric
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Demangel,
  
  Thanks for the code cross check will change rplot.plot to rpart.plot