Building Random Forest Classifier with Python Scikit learn

Random Forest Algorithm in Python

Random Forest Algorithm in Python

Building Random Forest Algorithm in Python

In the Introductory article about random forest algorithm, we addressed how the random forest algorithm works with real life examples. As continues to that, In this article we are going to build the random forest algorithm in python with the help of one of the best Python machine learning library Scikit-Learn.

To build the random forest algorithm we are going to use the Breast Cancer dataset. To summarize in this article we are going to build a random forest classifier to predict the Breast cancer type (Benign or Malignant).

Before we begin. Let’s quickly look at the table of contents.

Table of contents:

 

Building Random Forest Algorithm in Python Click To Tweet

Overview of Random forest algorithm

Random forest algorithm is an ensemble classification algorithm. Ensemble classifier means a group of classifiers. Instead of using only one classifier to predict the target, In ensemble, we use multiple classifiers to predict the target.

In case, of random forest, these ensemble classifiers are the randomly created decision trees. Each decision tree is a single classifier and the target prediction is based on the majority voting method.

The majority voting concept is same as the political votings. Each person votes per one political party out all the political parties participating in elections. In the same way, every classifier will votes to one target class out of all the target classes.

To declare the election results. The votes will calculate and the party which got the most number of votes treated as the election winner. In the same way, the target class which got the most number of votes considered as the final predicted target class.

Before we go further it’s better to spend some time on the below articles to understand how the random forest algorithm works.

I hope you have a clear understanding of how the random forest algorithm works. Now let’s implement the same. As I said earlier, we are going to use the breast cancer dataset to implement the random forest.

Before we begin let’s look at some stats and the impact of breast cancer in present generation.

About Breast Cancer

Sadly breast cancer is to second most death reason for women’s. In the US during the year 2016, almost  246,660 women’s breast cancer cases are diagnosed. The myth people believe tumor as cancer but which is not true.

Only the continuously growing tumor causes death. Based on this properties the tumors are mainly of 2 kinds.

  • Benign Tumor
  • Malignant Tumor
malignant benign tumor difference

malignant benign tumor difference
Image Credit:: thetruthaboutcancer.com

Benign:

A benign tumor is not a cancerous tumor. Which means it’s not able to spread through the body like the cancerous tumors. The benign is serious when it’s growing in sensitive places. This kind of tumors are will well terminated with proper treatment and with the change in diet habits.

Malignant

The malignant tumor is the cancerous tumor which causes death. These tumors can grow so fast and spread over various parts of the body.

A good read about these tumor and health prevention can be found in the thetruthaboutcancer article.

UCI breast cancer dataset description

We are using the UCI breast cancer dataset to build the random forest classifier in Python. You can download the data from UCI or You can download the code from Dataaspirant Github.

This breast cancer dataset is the most popular classification dataset. Which is having 10 features and 1 target class.

Breast Cancer dataset features:

  • Sample code number:
    • id number
  • Clump Thickness:
    • The values are in the range of 1 – 10
  • Uniformity of Cell Size:
    • The values are in the range of 1 – 10
  • Uniformity of Cell Shape:
    • The values are in the range of  1 – 10
  • Marginal Adhesion:
    • The values are in the range of 1 – 10
  • Single Epithelial Cell Size:
    • The values are in the range of  1 – 10
  • Bare Nuclei:
    • The values are in the range of  1 – 10
  • Bland Chromatin:
    • The values are in the range of  1 – 10
  • Normal Nucleoli:
    • The values are in the range of  1 – 10
  • Mitoses:
    • The values are in the range of  1 – 10

Breast Cancer dataset Target:

The target class having two target classes

  • Bening
    • The value will be 2
  • Malignant
    • The value will be 4

This dataset also having missing values. In the coding section of this article, we are to going deal with the missing values before we model the random forest algorithm.

Machine learning workflow

machine learning workflow

machine learning workflow

Implementing random forest algorithm in Python

To implement the random forest algorithm we are going follow the below two phase with step by step workflow.

  • Build Phase
    • Creating dataset
    • Handling missing values
    • Splitting data into train and test datasets
    • Training random forest classifier with Python scikit learn
  • Operational Phase
    • Perform predictions
    • Accuracy calculations
      • Train Accuracy
      • Test Accuracy

Let’s begin the journey of building the random forest classifier with importing the required Python machine learning packages.

Import required Python machine learning packages

The above python machine learning packages we are going to use to build the random forest classifier. Let’s talk about the need for these packages in random forest classifier implementation.

  • Pandas:
    • Pandas package is the best choice for tabular data analysis.
    • All the data manipulation tasks in this article are going to use the Pandas methods.
  • train_test_split:
    • We imported scikit-learn train_test_split method to split the breast cancer dataset into test and train dataset.
    • Train dataset will be used in the training phase and the test dataset will be used in the validation phase.
  • RandomForestClassifier:
    • We imported scikit-learn RandomForestClassifier method to model the training dataset with random forest classifier.
    • Later the modeled random forest classifier used to perform the predictions.
  • accuracy_score:
    • We imported scikit-learn accuracy_score method to calculate the accuracy of the trained classifier.
  • confusion_matrix:
    • We imported scikit-learn confusion_matrix to understand the trained classifier behavior over the test dataset or validate dataset.

Copy the above code in any text file (or you favorite txt editor) and save the file with the python extension (.py). Let say random_forest.py

Then call the random_forest.py file from the terminal using the below command.

If you install the python machine learning packages properly, you won’t face any issues. Even though you install the packages properly and you facing the issue ImportError: No module named model_selection. This means the scikit learn package you are using not updated to the new version.

I hope you are using scikit learn 0.17 or lesser version. You can copy and paste the below code to know your scikit learn version.

If the version your are using is 0.17 or lesser than that, you need to update your scikit learn version to 0.18

You can use the below commands to update your scikit learn to the new version (0.18)

Using Pip

Using Anaconda

Once you upgraded your scikit-learn package. Run the above code and you won’t face any issues. If you still face any issue to run the above code do please let me know in the comments section.

Now let’s create the dataset to model the random forest classifier.

Creating dataset

The downloaded dataset is in the data format. So we are going to convert into the CSV format. To do that we are going to write a simple function which first loads the data format into the pandas dataframe and later the loaded dataframe will save into the CSV file format.

 

The INPUT_PATH is having the path for the downloaded data format file and the OUTPUT_PATH is having the output where the CSV format file is going to save.

Using the pandas read_csv method we loaded the data format file into pandas dataframe.

The loaded dataset doesn’t have the header names. So we need to add the header names to the loaded dataframe. To do the same we have written a function with takes the dataset and header names as input and add the header names to the dataset.

 

After adding the header names to dataset we are saving the dataset into CSV format. While saving the file we parameterized the index=False. When we save the loaded dataframe without this the saved file will have an extra column with the indexes. So to eliminate this we are parameterized the index=False.

Now we are ready with the dataset. The next biggest thing is the preprocessing the data.

Sometimes, if it’s our day, we don’t need to do much work on the preprocessing stage. But, if not.

We need to spend a lot of time in the preprocessing stage.

Handling of missing values once such task in preprocessing the data.

Handling missing values

The process of handling missing values will differ from dataset to dataset. For the cancer dataset, we are using simple tasks to handle the missing values in the loaded dataset.

Before reviling what those missing values, I want to show the ways to identify the missing values. So when you are working on a different dataset. You can identify the missing values by yourself.

The best idea to start with is, calculating basic statistics for each column (features and target) of the dataset. You may be wondering what the use of calculating basic statistics of the dataset and how it gonna helps to find the missing values.

Yes, finding the basic statistics will helps us to find the missing values in the dataset. The idea is we can use pandas describe method on the loaded dataset to calculate the basic statistics. This outputs the stats about only the columns which are not having any missing values or categorical values.

Feeling missed somewhere, No issue let’s implement a function to calculate the basic statistics then you get the clear idea of what I talking about.

As I said before, We are using pandas describe method to get the basic statistics of the dataset.

Now let’s call this function and check the what it’s outputting.

Script Output

 

If you observe the above statistics clearly you can identify that we are missing one column details. The column header we are missing is  BareNuclei.

Now open the CSV file and check the column details of Bare Nuclei. You will find the missing values replaced with ? Which means we need to treat those values as missing values.

As we know the missing values column and the character used to represent the missing values. Now let’s write a simple function will take the dataset, header_name and missing value representing the character as input handles the missing values.

Now let’s call this function inside the main function

Now our dataset is missing values free. Now let’s split the data into train and test dataset. The training dataset will use to train the random forest classifier and the test dataset used the validate the model random forest classifier.

Split data into train and test datasets

To split the data into train and test dataset, Let’s write a function which takes the dataset, train percentage, feature header names and target header name as inputs and returns the train_x, test_x, train_y and test_y as outputs.

Now let’s call the above function inside the main function.

Where the HEADERS[1:-1] contains all the features header names. We eliminated the first and last header names. The first header name is id and the last header the target header.  The HEADERS[-1] contains the target header name.

We can print the shape of the train_x, test_x, train_y and test_y to check whether the split proper or not.

Script output:

From the above result, it’s clear that the train and test split was proper. Now let’s build the random forest classifier using the train_x and train_y datasets.

Training random forest classifier with scikit learn

To train the random forest classifier we are going to use the below random_forest_classifier function. Which requires the features (train_x) and target (train_y) data as inputs and returns the train random forest classifier as output.

 

Now let’s call the above function inside the main function and print the trained classifier.

 

Script Output:

 

Perform predictions

As we model the classifier. Now let’s take few observations from test dataset and print what our model is predicting and what the actual target.

To do that first let’s predict target for all the test features (test_x) using the trained classifier. Later we will see what our trained model is predicting and what the actual output could be.

First I converted the test_y into list object from pandas dataframe. The reason is as we randomly split the train and test dataset the indexes of the test_y won’t be in order. If we convert the dataframe in to list object the indexes will be in order.

From the above code, we are printing the first 5 values of test_y and the predict results target.

 

Seems like the trained classifier predicted the first 5 target classes correctly. To know more about the model let’s check the train and test accuracy information.

Accuracy calculations

To calculate the accuracy we are using scikit learn the accuracy_score method.

Script output:

Our trained classifier model giving 99% accuracy for train dataset and 97% accuracy for test dataset.

Confusion matrix

To know the ture_postive and true_negative details. Let’s print the confusion matrix of our trained classifier.

 

Script output:

You can get the complete code below. If you know GitHub, you can clone the complete code from our Github account.

Complete Code:

 

Summary

In this article, you learned how to implement the most popular classification algorithm random forest in python using python scikit learn package.

On process, you learned how to handle the missing values. Finally, you learned how to calculate the accuracy of any trained classifier using the scikit learn accuray_score method.

Follow us:

FACEBOOKQUORA |TWITTERGOOGLE+ | LINKEDINREDDIT FLIPBOARD | MEDIUM | GITHUB

I hope you like this post. If you have any questions, then feel free to comment below.  If you want me to write on one particular topic, then do tell it to me in the comments below.

Recommended Data Science Courses

11 Responses to “Building Random Forest Classifier with Python Scikit learn

  • Hello,

    I found an error data_file_to_csv() is not called in main().

  • Bharat Singh
    2 months ago

    Very Nice explanation!
    Real word example is very interesting. It’s give depth knowledge about the RF working.

  • thank you that is really nice

  • I strongly recommend this article who want to know how to write the lines of RF and begin ML. And thank you as one of them.

  • Thanks for your clear and complete article!
    If it is possible, plotting of the results could be helpful.

    • Hi Shon,
      Thanks a lot for your compliment 🙂

      Will try to publish an article on how to visualize the trained random forest classifier.

  • Tanveer Ahmed
    3 weeks ago

    first, thank you for such a detailed explanation on Machine Learning.
    But I am facing error while running the random forest example:
    ###
    Traceback (most recent call last):
    File “”, line 2, in
    main()
    File “”, line 11, in main
    dataset = handel_missing_values(dataset, HEADERS[6], ‘7’)
    NameError: global name ‘HEADERS’ is not defined
    ####
    This error is observed when the below code is executed:
    ###
    def main():
    “””
    Main function
    :return:
    “””
    # Load the csv file into pandas dataframe
    dataset = pd.read_csv(OUTPUT_PATH)
    # Get basic statistics of the loaded dataset
    dataset_statistics(dataset)
    # Filter missing values
    dataset = handel_missing_values(dataset, HEADERS[6], ‘7’)
    ###

    • Hi Tanveer,

      Thanks for your compliment 🙂

      We missed adding the HEADERS list in the code, Which rise the error.
      Updated the article, Have a look.

Leave a Reply

Your email address will not be published. Required fields are marked *