# Building Random Forest Classifier with Python Scikit learn Random Forest Algorithm in Python

# Building Random Forest Algorithm in Python

In the Introductory article about random forest algorithm, we addressed how the random forest algorithm works with real life examples. As continues to that, In this article we are going to build the random forest algorithm in python with the help of one of the best Python machine learning library Scikit-Learn.

To build the random forest algorithm we are going to use the Breast Cancer dataset. To summarize in this article we are going to build a random forest classifier to predict the Breast cancer type (Benign or Malignant).

• Overview of Random forest algorithm
• Benign
• Malignant
• UCI breast cancer dataset description
• Machine learning workflow
• Implementing random forest algorithm in Python
• Creating dataset
• Handling missing values
• Split data into train and test dataset
• Training random forest classifier with scikit learn
• Perform predictions
• Accuracy calculations
• Train Accuracy
• Test Accuracy
• Confusion matrix
• Summary
• Recommended Data Science Courses

### Overview of Random forest algorithm

Random forest algorithm is an ensemble classification algorithm. Ensemble classifier means a group of classifiers. Instead of using only one classifier to predict the target, In ensemble, we use multiple classifiers to predict the target.

In case, of random forest, these ensemble classifiers are the randomly created decision trees. Each decision tree is a single classifier and the target prediction is based on the majority voting method.

The majority voting concept is same as the political votings. Each person votes per one political party out all the political parties participating in elections. In the same way, every classifier will votes to one target class out of all the target classes.

To declare the election results. The votes will calculate and the party which got the most number of votes treated as the election winner. In the same way, the target class which got the most number of votes considered as the final predicted target class.

Before we go further it’s better to spend some time on the below articles to understand how the random forest algorithm works.

I hope you have a clear understanding of how the random forest algorithm works. Now let’s implement the same. As I said earlier, we are going to use the breast cancer dataset to implement the random forest.

Before we begin let’s look at some stats and the impact of breast cancer in present generation.

Sadly breast cancer is to second most death reason for women’s. In the US during the year 2016, almost  246,660 women’s breast cancer cases are diagnosed. The myth people believe tumor as cancer but which is not true.

Only the continuously growing tumor causes death. Based on this properties the tumors are mainly of 2 kinds.

• Benign Tumor
• Malignant Tumor malignant benign tumor difference

#### Benign:

A benign tumor is not a cancerous tumor. Which means it’s not able to spread through the body like the cancerous tumors. The benign is serious when it’s growing in sensitive places. This kind of tumors are will well terminated with proper treatment and with the change in diet habits.

#### Malignant

The malignant tumor is the cancerous tumor which causes death. These tumors can grow so fast and spread over various parts of the body.

### UCI breast cancer dataset description

We are using the UCI breast cancer dataset to build the random forest classifier in Python. You can download the data from UCI or You can download the code from Dataaspirant Github.

This breast cancer dataset is the most popular classification dataset. Which is having 10 features and 1 target class.

#### Breast Cancer dataset features:

• Sample code number:
• id number
• Clump Thickness:
• The values are in the range of 1 – 10
• Uniformity of Cell Size:
• The values are in the range of 1 – 10
• Uniformity of Cell Shape:
• The values are in the range of  1 – 10
• The values are in the range of 1 – 10
• Single Epithelial Cell Size:
• The values are in the range of  1 – 10
• Bare Nuclei:
• The values are in the range of  1 – 10
• Bland Chromatin:
• The values are in the range of  1 – 10
• Normal Nucleoli:
• The values are in the range of  1 – 10
• Mitoses:
• The values are in the range of  1 – 10

#### Breast Cancer dataset Target:

The target class having two target classes

• Bening
• The value will be 2
• Malignant
• The value will be 4

This dataset also having missing values. In the coding section of this article, we are to going deal with the missing values before we model the random forest algorithm.

### Machine learning workflow machine learning workflow

### Implementing random forest algorithm in Python

To implement the random forest algorithm we are going follow the below two phase with step by step workflow.

• Build Phase
• Creating dataset
• Handling missing values
• Splitting data into train and test datasets
• Training random forest classifier with Python scikit learn
• Operational Phase
• Perform predictions
• Accuracy calculations
• Train Accuracy
• Test Accuracy

Let’s begin the journey of building the random forest classifier with importing the required Python machine learning packages.

#### Import required Python machine learning packages

```# Required Python Packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix```

The above python machine learning packages we are going to use to build the random forest classifier. Let’s talk about the need for these packages in random forest classifier implementation.

• Pandas:
• Pandas package is the best choice for tabular data analysis.
• train_test_split:
• We imported scikit-learn train_test_split method to split the breast cancer dataset into test and train dataset.
• Train dataset will be used in the training phase and the test dataset will be used in the validation phase.
• RandomForestClassifier:
• We imported scikit-learn RandomForestClassifier method to model the training dataset with random forest classifier.
• Later the modeled random forest classifier used to perform the predictions.
• accuracy_score:
• We imported scikit-learn accuracy_score method to calculate the accuracy of the trained classifier.
• confusion_matrix:
• We imported scikit-learn confusion_matrix to understand the trained classifier behavior over the test dataset or validate dataset.

Copy the above code in any text file (or you favorite txt editor) and save the file with the python extension (.py). Let say random_forest.py

Then call the random_forest.py file from the terminal using the below command.

`python random_forest.py`

If you install the python machine learning packages properly, you won’t face any issues. Even though you install the packages properly and you facing the issue ImportError: No module named model_selection. This means the scikit learn package you are using not updated to the new version.

I hope you are using scikit learn 0.17 or lesser version. You can copy and paste the below code to know your scikit learn version.

```import sklearn
print (sklearn.__version__)```

If the version your are using is 0.17 or lesser than that, you need to update your scikit learn version to 0.18

You can use the below commands to update your scikit learn to the new version (0.18)

Using Pip

`pip install -U scikit-learn`

Using Anaconda

`conda install scikit-learn=0.18`

Once you upgraded your scikit-learn package. Run the above code and you won’t face any issues. If you still face any issue to run the above code do please let me know in the comments section.

Now let’s create the dataset to model the random forest classifier.

#### Creating dataset

The downloaded dataset is in the data format. So we are going to convert into the CSV format. To do that we are going to write a simple function which first loads the data format into the pandas dataframe and later the loaded dataframe will save into the CSV file format.

```# File Paths
INPUT_PATH = "../inputs/breast-cancer-wisconsin.data"
OUTPUT_PATH = "../inputs/breast-cancer-wisconsin.csv"

def data_file_to_csv():
"""

:return:
"""

"SingleEpithelialCellSize", "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses",
"CancerType"]
# Load the dataset into Pandas data frame
# Save the loaded dataset into csv format
dataset.to_csv(OUTPUT_PATH, index=False)
print "File saved ...!"```

The INPUT_PATH is having the path for the downloaded data format file and the OUTPUT_PATH is having the output where the CSV format file is going to save.

Using the pandas read_csv method we loaded the data format file into pandas dataframe.

The loaded dataset doesn’t have the header names. So we need to add the header names to the loaded dataframe. To do the same we have written a function with takes the dataset and header names as input and add the header names to the dataset.

```def add_headers(dataset, headers):
"""
:param dataset:
:return:
"""
return dataset```

After adding the header names to dataset we are saving the dataset into CSV format. While saving the file we parameterized the index=False. When we save the loaded dataframe without this the saved file will have an extra column with the indexes. So to eliminate this we are parameterized the index=False.

Now we are ready with the dataset. The next biggest thing is the preprocessing the data.

Sometimes, if it’s our day, we don’t need to do much work on the preprocessing stage. But, if not.

We need to spend a lot of time in the preprocessing stage.

Handling of missing values once such task in preprocessing the data.

#### Handling missing values

The process of handling missing values will differ from dataset to dataset. For the cancer dataset, we are using simple tasks to handle the missing values in the loaded dataset.

Before reviling what those missing values, I want to show the ways to identify the missing values. So when you are working on a different dataset. You can identify the missing values by yourself.

The best idea to start with is, calculating basic statistics for each column (features and target) of the dataset. You may be wondering what the use of calculating basic statistics of the dataset and how it gonna helps to find the missing values.

Yes, finding the basic statistics will helps us to find the missing values in the dataset. The idea is we can use pandas describe method on the loaded dataset to calculate the basic statistics. This outputs the stats about only the columns which are not having any missing values or categorical values.

Feeling missed somewhere, No issue let’s implement a function to calculate the basic statistics then you get the clear idea of what I talking about.

```def dataset_statistics(dataset):
"""
Basic statistics of the dataset
:param dataset: Pandas dataframe
:return: None, print the basic statistics of the dataset
"""
print dataset.describe()```

As I said before, We are using pandas describe method to get the basic statistics of the dataset.

Now let’s call this function and check the what it’s outputting.

```def main():
"""
Main function
:return:
"""
# Load the csv file into pandas dataframe
# Get basic statistics of the loaded dataset
dataset_statistics(dataset)

if __name__ == "__main__":
main()```

Script Output

```CodeNumber  ClumpThickness  UniformityCellSize  UniformityCellShape
count  6.980000e+02      698.000000          698.000000           698.000000
mean   1.071807e+06        4.416905            3.137536             3.210602
std    6.175323e+05        2.817673            3.052575             2.972867
min    6.163400e+04        1.000000            1.000000             1.000000
25%    8.702582e+05        2.000000            1.000000             1.000000
50%    1.171710e+06        4.000000            1.000000             1.000000
75%    1.238354e+06        6.000000            5.000000             5.000000
max    1.345435e+07       10.000000           10.000000            10.000000

count        698.000000                698.000000      698.000000
mean           2.809456                  3.217765        3.438395
std            2.856606                  2.215408        2.440056
min            1.000000                  1.000000        1.000000
25%            1.000000                  2.000000        2.000000
50%            1.000000                  2.000000        3.000000
75%            4.000000                  4.000000        5.000000
max           10.000000                 10.000000       10.000000

NormalNucleoli     Mitoses  CancerType
count      698.000000  698.000000  698.000000
mean         2.869628    1.590258    2.690544
std          3.055004    1.716162    0.951596
min          1.000000    1.000000    2.000000
25%          1.000000    1.000000    2.000000
50%          1.000000    1.000000    2.000000
75%          4.000000    1.000000    4.000000
max         10.000000   10.000000    4.000000
```

If you observe the above statistics clearly you can identify that we are missing one column details. The column header we are missing is  BareNuclei.

Now open the CSV file and check the column details of Bare Nuclei. You will find the missing values replaced with ? Which means we need to treat those values as missing values.

As we know the missing values column and the character used to represent the missing values. Now let’s write a simple function will take the dataset, header_name and missing value representing the character as input handles the missing values.

```def handel_missing_values(dataset, missing_values_header, missing_label):
"""
Filter missing values from the dataset
:param dataset:
:param missing_label:
:return:
"""

Now let’s call this function inside the main function

```def main():
"""
Main function
:return:
"""
"SingleEpithelialCellSize", "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses",
"CancerType"]
# Load the csv file into pandas dataframe
# Get basic statistics of the loaded dataset
dataset_statistics(dataset)

# Filter missing values
if __name__ == "__main__":
main()```

Now our dataset is missing values free. Now let’s split the data into train and test dataset. The training dataset will use to train the random forest classifier and the test dataset used the validate the model random forest classifier.

#### Split data into train and test datasets

To split the data into train and test dataset, Let’s write a function which takes the dataset, train percentage, feature header names and target header name as inputs and returns the train_x, test_x, train_y and test_y as outputs.

```def split_dataset(dataset, train_percentage, feature_headers, target_header):
"""
Split the dataset with train_percentage
:param dataset:
:param train_percentage:
:return: train_x, test_x, train_y, test_y
"""

# Split dataset into train and test dataset
train_size=train_percentage)
return train_x, test_x, train_y, test_y```

Now let’s call the above function inside the main function.

```def main():
"""
Main function
:return:
"""
# Load the csv file into pandas dataframe
# Get basic statistics of the loaded dataset
dataset_statistics(dataset)

# Filter missing values

if __name__ == "__main__":
main()```

We can print the shape of the train_x, test_x, train_y and test_y to check whether the split proper or not.

```def main():
"""
Main function
:return:
"""
# Load the csv file into pandas dataframe
# Get basic statistics of the loaded dataset
dataset_statistics(dataset)

# Filter missing values

# Train and Test dataset size details
print "Train_x Shape :: ", train_x.shape
print "Train_y Shape :: ", train_y.shape
print "Test_x Shape :: ", test_x.shape
print "Test_y Shape :: ", test_y.shape

if __name__ == "__main__":
main()```

Script output:

```Train_x Shape ::  (477, 9)
Train_y Shape ::  (477,)
Test_x Shape ::  (205, 9)
Test_y Shape ::  (205,)```

From the above result, it’s clear that the train and test split was proper. Now let’s build the random forest classifier using the train_x and train_y datasets.

#### Training random forest classifier with scikit learn

To train the random forest classifier we are going to use the below random_forest_classifier function. Which requires the features (train_x) and target (train_y) data as inputs and returns the train random forest classifier as output.

```def random_forest_classifier(features, target):
"""
To train the random forest classifier with features and target data
:param features:
:param target:
:return: trained random forest classifier
"""
clf = RandomForestClassifier()
clf.fit(features, target)
return clf```

Now let’s call the above function inside the main function and print the trained classifier.

```def main():
"""
Main function
:return:
"""
# Load the csv file into pandas dataframe

# Filter missing values

# Create random forest classifier instance
trained_model = random_forest_classifier(train_x, train_y)
print "Trained model :: ", trained_model
if __name__ == "__main__":
main()```

Script Output:

```Trained model ::  RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)```

### Perform predictions

As we model the classifier. Now let’s take few observations from test dataset and print what our model is predicting and what the actual target.

To do that first let’s predict target for all the test features (test_x) using the trained classifier. Later we will see what our trained model is predicting and what the actual output could be.

```def main():
"""
Main function
:return:
"""
# Load the csv file into pandas dataframe

# Filter missing values

# Create random forest classifier instance
trained_model = random_forest_classifier(train_x, train_y)
print "Trained model :: ", trained_model
predictions = trained_model.predict(test_x)

for i in xrange(0, 5):
print "Actual outcome :: {} and Predicted outcome :: {}".format(list(test_y)[i], predictions[i])
if __name__ == "__main__":
main()```

First I converted the test_y into list object from pandas dataframe. The reason is as we randomly split the train and test dataset the indexes of the test_y won’t be in order. If we convert the dataframe in to list object the indexes will be in order.

From the above code, we are printing the first 5 values of test_y and the predict results target.

```Actual outcome :: 2 and Predicted outcome :: 2
Actual outcome :: 2 and Predicted outcome :: 2
Actual outcome :: 2 and Predicted outcome :: 2
Actual outcome :: 2 and Predicted outcome :: 2
Actual outcome :: 4 and Predicted outcome :: 4```

Seems like the trained classifier predicted the first 5 target classes correctly. To know more about the model let’s check the train and test accuracy information.

### Accuracy calculations

```def main():
"""
Main function
:return:
"""
# Load the csv file into pandas dataframe

# Filter missing values

# Create random forest classifier instance
trained_model = random_forest_classifier(train_x, train_y)
print "Trained model :: ", trained_model
predictions = trained_model.predict(test_x)

# Train and Test Accuracy
print "Train Accuracy :: ", accuracy_score(train_y, trained_model.predict(train_x))
print "Test Accuracy  :: ", accuracy_score(test_y, predictions)

if __name__ == "__main__":
main()```

To calculate the accuracy we are using scikit learn the accuracy_score method.

Script output:

```Train Accuracy ::  0.991614255765
Test Accuracy  ::  0.970731707317```

Our trained classifier model giving 99% accuracy for train dataset and 97% accuracy for test dataset.

### Confusion matrix

To know the ture_postive and true_negative details. Let’s print the confusion matrix of our trained classifier.

```def main():
"""
Main function
:return:
"""
# Load the csv file into pandas dataframe

# Filter missing values

# Create random forest classifier instance
trained_model = random_forest_classifier(train_x, train_y)
predictions = trained_model.predict(test_x)

print "Train Accuracy :: ", accuracy_score(train_y, trained_model.predict(train_x))
print "Test Accuracy  :: ", accuracy_score(test_y, predictions)
print " Confusion matrix ", confusion_matrix(test_y, predictions)

if __name__ == "__main__":
main()```

Script output:

```Train Accuracy ::  0.991614255765
Test Accuracy  ::  0.970731707317
Confusion matrix  [[123   5]
[  1  76]]```

You can get the complete code below. If you know GitHub, you can clone the complete code from our Github account.

#### Complete Code:

```# Required Python Packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

import pdb

# File Paths
INPUT_PATH = "../inputs/breast-cancer-wisconsin.data"
OUTPUT_PATH = "../inputs/breast-cancer-wisconsin.csv"

"SingleEpithelialCellSize", "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses", "CancerType"]

"""
Read the data into pandas dataframe
:param path:
:return:
"""
return data

"""
:param dataset:
:return:
"""
return dataset.columns.values

"""
:param dataset:
:return:
"""
return dataset

def data_file_to_csv():
"""

:return:
"""

"SingleEpithelialCellSize", "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses",
"CancerType"]
# Load the dataset into Pandas data frame
# Save the loaded dataset into csv format
dataset.to_csv(OUTPUT_PATH, index=False)
print "File saved ...!"

"""
Split the dataset with train_percentage
:param dataset:
:param train_percentage:
:return: train_x, test_x, train_y, test_y
"""

# Split dataset into train and test dataset
train_size=train_percentage)
return train_x, test_x, train_y, test_y

"""
Filter missing values from the dataset
:param dataset:
:param missing_label:
:return:
"""

def random_forest_classifier(features, target):
"""
To train the random forest classifier with features and target data
:param features:
:param target:
:return: trained random forest classifier
"""
clf = RandomForestClassifier()
clf.fit(features, target)
return clf

def dataset_statistics(dataset):
"""
Basic statistics of the dataset
:param dataset: Pandas dataframe
:return: None, print the basic statistics of the dataset
"""
print dataset.describe()

def main():
"""
Main function
:return:
"""
# Load the csv file into pandas dataframe
# Get basic statistics of the loaded dataset
dataset_statistics(dataset)

# Filter missing values

# Train and Test dataset size details
print "Train_x Shape :: ", train_x.shape
print "Train_y Shape :: ", train_y.shape
print "Test_x Shape :: ", test_x.shape
print "Test_y Shape :: ", test_y.shape

# Create random forest classifier instance
trained_model = random_forest_classifier(train_x, train_y)
print "Trained model :: ", trained_model
predictions = trained_model.predict(test_x)

for i in xrange(0, 5):
print "Actual outcome :: {} and Predicted outcome :: {}".format(list(test_y)[i], predictions[i])

print "Train Accuracy :: ", accuracy_score(train_y, trained_model.predict(train_x))
print "Test Accuracy  :: ", accuracy_score(test_y, predictions)
print " Confusion matrix ", confusion_matrix(test_y, predictions)

if __name__ == "__main__":
main()
```

### Summary

In this article, you learned how to implement the most popular classification algorithm random forest in python using python scikit learn package.

On process, you learned how to handle the missing values. Finally, you learned how to calculate the accuracy of any trained classifier using the scikit learn accuray_score method.

I hope you like this post. If you have any questions, then feel free to comment below.  If you want me to write on one particular topic, then do tell it to me in the comments below.

### 30 Responses to “Building Random Forest Classifier with Python Scikit learn”

• Sagar Kalra
4 years ago

You are my love bro :*

• 3 years ago

Hi Sagar Kalra,

Thanks for the compliment.

We wish you a very happy learning.

• Manjunath
4 years ago

KEY ERROR :BareNuclei

• 2 years ago

Hi Manjunath,

The key you need to consider is BareNuclei not :BareNuclei. Could you please try this key and let me know if you still face the issue.

• Manjunath
4 years ago

File b’../inputs/breast-cancer-wisconsin.csv’ does not exist

• 2 years ago

Hi Manjunath,

Thanks and happy learning!

• Manjunath
4 years ago

• 2 years ago

Hi Manjunath,

When you copy the code in the article, Please check the indentation is properly followed in the code editor you are using, You can compare the code in the article and in your editor. I hope this will resolve issue. If not let me know.

Thanks and happy learning!

• 4 years ago

If you wold have added the data set (the csv file) it would have been great.

• 2 years ago

Hi Sasha,

Thanks and happy learning!

• Xiaoping
5 years ago

Can I specify max-number of sub-tree in the random forest modelling method?

• 2 years ago

Hi Xiaoping,

You can try out the best max number of sub-tree for random forest using the grid parameter search instead of manually trying different max-number values.

Thanks and happy learning.

• lol
5 years ago

Great article, thanks!

• 5 years ago

Hi Lol,

• Shubhangi
5 years ago

Hello,

• 5 years ago

Hi Shubhangi,

I hope you haven’t installed the python machine learning packages properly. Please follow how to create the machine learning python envirnoment article.

• Sakthi
5 years ago

How exactly does the handle missing values function work in the above code? I’m new to pandas

• 5 years ago

Hi Sakthi,

The function in the article which handles the missing values is pretty simple one. From the data itself, we know that having “?” represents the missing observation. So in the handling missing values function, we are just checking if the observation is having “?” as value then we are not considering those observations.

• Tanveer Ahmed
5 years ago

first, thank you for such a detailed explanation on Machine Learning.
But I am facing error while running the random forest example:
###
Traceback (most recent call last):
File “”, line 2, in
main()
File “”, line 11, in main
NameError: global name ‘HEADERS’ is not defined
####
This error is observed when the below code is executed:
###
def main():
“””
Main function
:return:
“””
# Load the csv file into pandas dataframe
# Get basic statistics of the loaded dataset
dataset_statistics(dataset)
# Filter missing values
###

• 5 years ago

Hi Tanveer,

We missed adding the HEADERS list in the code, Which rise the error.
Updated the article, Have a look.

• Shon
5 years ago

Thanks for your clear and complete article!
If it is possible, plotting of the results could be helpful.

• 5 years ago

Hi Shon,
Thanks a lot for your compliment 🙂

Will try to publish an article on how to visualize the trained random forest classifier.

• Shon
5 years ago

I strongly recommend this article who want to know how to write the lines of RF and begin ML. And thank you as one of them.

• 5 years ago

Hi Shon,

Thanks a lot for your compliment 🙂 .

• rebeen
5 years ago

thank you that is really nice

• 5 years ago

Hi Rebeen,

• Bharat Singh
5 years ago

Very Nice explanation!
Real word example is very interesting. It’s give depth knowledge about the RF working.

• 5 years ago

Hi Bharat Singh,
• 