Building Random Forest Classifier with Python Scikit learn

June 26, 2017 Saimadhu Polamuri

Random Forest Algorithm in Python

Building Random Forest Algorithm in Python

In the Introductory article about random forest algorithm, we addressed how the random forest algorithm works with real life examples. As continues to that, In this article we are going to build the random forest algorithm in python with the help of one of the best Python machine learning library Scikit-Learn.

To build the random forest algorithm we are going to use the Breast Cancer dataset. To summarize in this article we are going to build a random forest classifier to predict the Breast cancer type (Benign or Malignant).

Before we begin. Let’s quickly look at the table of contents.

Overview of Random forest algorithm
About Breast Cancer
- Benign
- Malignant
UCI breast cancer dataset description
Machine learning workflow
Implementing random forest algorithm in Python
- Creating dataset
- Handling missing values
- Split data into train and test dataset
- Training random forest classifier with scikit learn
Perform predictions
Accuracy calculations
- Train Accuracy
- Test Accuracy
Confusion matrix
Summary
Recommended Data Science Courses

Building Random Forest Algorithm in Python Click To Tweet

Overview of Random forest algorithm

Random forest algorithm is an ensemble classification algorithm. Ensemble classifier means a group of classifiers. Instead of using only one classifier to predict the target, In ensemble, we use multiple classifiers to predict the target.

In case, of random forest, these ensemble classifiers are the randomly created decision trees. Each decision tree is a single classifier and the target prediction is based on the majority voting method.

The majority voting concept is same as the political votings. Each person votes per one political party out all the political parties participating in elections. In the same way, every classifier will votes to one target class out of all the target classes.

To declare the election results. The votes will calculate and the party which got the most number of votes treated as the election winner. In the same way, the target class which got the most number of votes considered as the final predicted target class.

Before we go further it’s better to spend some time on the below articles to understand how the random forest algorithm works.

I hope you have a clear understanding of how the random forest algorithm works. Now let’s implement the same. As I said earlier, we are going to use the breast cancer dataset to implement the random forest.

Before we begin let’s look at some stats and the impact of breast cancer in present generation.

About Breast Cancer

Sadly breast cancer is to second most death reason for women’s. In the US during the year 2016, almost 246,660 women’s breast cancer cases are diagnosed. The myth people believe tumor as cancer but which is not true.

Only the continuously growing tumor causes death. Based on this properties the tumors are mainly of 2 kinds.

Benign Tumor
Malignant Tumor

malignant benign tumor difference
Image Credit:: thetruthaboutcancer.com

Benign:

A benign tumor is not a cancerous tumor. Which means it’s not able to spread through the body like the cancerous tumors. The benign is serious when it’s growing in sensitive places. This kind of tumors are will well terminated with proper treatment and with the change in diet habits.

Malignant

The malignant tumor is the cancerous tumor which causes death. These tumors can grow so fast and spread over various parts of the body.

A good read about these tumor and health prevention can be found in the thetruthaboutcancer article.

UCI breast cancer dataset description

We are using the UCI breast cancer dataset to build the random forest classifier in Python. You can download the data from UCI or You can download the code from Dataaspirant Github.

This breast cancer dataset is the most popular classification dataset. Which is having 10 features and 1 target class.

Breast Cancer dataset features:

Sample code number:
- id number
Clump Thickness:
- The values are in the range of 1 – 10
Uniformity of Cell Size:
- The values are in the range of 1 – 10
Uniformity of Cell Shape:
- The values are in the range of 1 – 10
Marginal Adhesion:
- The values are in the range of 1 – 10
Single Epithelial Cell Size:
- The values are in the range of 1 – 10
Bare Nuclei:
- The values are in the range of 1 – 10
Bland Chromatin:
- The values are in the range of 1 – 10
Normal Nucleoli:
- The values are in the range of 1 – 10
Mitoses:
- The values are in the range of 1 – 10

Breast Cancer dataset Target:

The target class having two target classes

Bening
- The value will be 2
Malignant
- The value will be 4

This dataset also having missing values. In the coding section of this article, we are to going deal with the missing values before we model the random forest algorithm.

Machine learning workflow

machine learning workflow

Implementing random forest algorithm in Python

To implement the random forest algorithm we are going follow the below two phase with step by step workflow.

Build Phase
- Creating dataset
- Handling missing values
- Splitting data into train and test datasets
- Training random forest classifier with Python scikit learn
Operational Phase
- Perform predictions
- Accuracy calculations
  - Train Accuracy
  - Test Accuracy

Let’s begin the journey of building the random forest classifier with importing the required Python machine learning packages.

Import required Python machine learning packages

# Required Python Packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

The above python machine learning packages we are going to use to build the random forest classifier. Let’s talk about the need for these packages in random forest classifier implementation.

Pandas:
- Pandas package is the best choice for tabular data analysis.
- All the data manipulation tasks in this article are going to use the Pandas methods.
train_test_split:
- We imported scikit-learn train_test_split method to split the breast cancer dataset into test and train dataset.
- Train dataset will be used in the training phase and the test dataset will be used in the validation phase.
RandomForestClassifier:
- We imported scikit-learn RandomForestClassifier method to model the training dataset with random forest classifier.
- Later the modeled random forest classifier used to perform the predictions.
accuracy_score:
- We imported scikit-learn accuracy_score method to calculate the accuracy of the trained classifier.
confusion_matrix:
- We imported scikit-learn confusion_matrix to understand the trained classifier behavior over the test dataset or validate dataset.

Copy the above code in any text file (or you favorite txt editor) and save the file with the python extension (.py). Let say random_forest.py

Then call the random_forest.py file from the terminal using the below command.

python random_forest.py

If you install the python machine learning packages properly, you won’t face any issues. Even though you install the packages properly and you facing the issue ImportError: No module named model_selection. This means the scikit learn package you are using not updated to the new version.

I hope you are using scikit learn 0.17 or lesser version. You can copy and paste the below code to know your scikit learn version.

import sklearn
print (sklearn.__version__)

If the version your are using is 0.17 or lesser than that, you need to update your scikit learn version to 0.18

You can use the below commands to update your scikit learn to the new version (0.18)

Using Pip

pip install -U scikit-learn

Using Anaconda

conda install scikit-learn=0.18

Once you upgraded your scikit-learn package. Run the above code and you won’t face any issues. If you still face any issue to run the above code do please let me know in the comments section.

Now let’s create the dataset to model the random forest classifier.

Creating dataset

The downloaded dataset is in the data format. So we are going to convert into the CSV format. To do that we are going to write a simple function which first loads the data format into the pandas dataframe and later the loaded dataframe will save into the CSV file format.

# File Paths
INPUT_PATH = "../inputs/breast-cancer-wisconsin.data"
OUTPUT_PATH = "../inputs/breast-cancer-wisconsin.csv"


def data_file_to_csv():
    """

    :return:
    """
    
    # Headers 
    headers = ["CodeNumber", "ClumpThickness", "UniformityCellSize", "UniformityCellShape", "MarginalAdhesion",
               "SingleEpithelialCellSize", "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses",
               "CancerType"]
    # Load the dataset into Pandas data frame
    dataset = read_data(INPUT_PATH)
    # Add the headers to the loaded dataset
    dataset = add_headers(dataset, headers)
    # Save the loaded dataset into csv format
    dataset.to_csv(OUTPUT_PATH, index=False)
    print "File saved ...!"

The INPUT_PATH is having the path for the downloaded data format file and the OUTPUT_PATH is having the output where the CSV format file is going to save.

Using the pandas read_csv method we loaded the data format file into pandas dataframe.

The loaded dataset doesn’t have the header names. So we need to add the header names to the loaded dataframe. To do the same we have written a function with takes the dataset and header names as input and add the header names to the dataset.

def add_headers(dataset, headers):
    """
    Add the headers to the dataset
    :param dataset:
    :param headers:
    :return:
    """
    dataset.columns = headers
    return dataset

After adding the header names to dataset we are saving the dataset into CSV format. While saving the file we parameterized the index=False. When we save the loaded dataframe without this the saved file will have an extra column with the indexes. So to eliminate this we are parameterized the index=False.

Now we are ready with the dataset. The next biggest thing is the preprocessing the data.

Sometimes, if it’s our day, we don’t need to do much work on the preprocessing stage. But, if not.

We need to spend a lot of time in the preprocessing stage.

Handling of missing values once such task in preprocessing the data.

Handling missing values

The process of handling missing values will differ from dataset to dataset. For the cancer dataset, we are using simple tasks to handle the missing values in the loaded dataset.

Before reviling what those missing values, I want to show the ways to identify the missing values. So when you are working on a different dataset. You can identify the missing values by yourself.

The best idea to start with is, calculating basic statistics for each column (features and target) of the dataset. You may be wondering what the use of calculating basic statistics of the dataset and how it gonna helps to find the missing values.

Yes, finding the basic statistics will helps us to find the missing values in the dataset. The idea is we can use pandas describe method on the loaded dataset to calculate the basic statistics. This outputs the stats about only the columns which are not having any missing values or categorical values.

Feeling missed somewhere, No issue let’s implement a function to calculate the basic statistics then you get the clear idea of what I talking about.

def dataset_statistics(dataset):
    """
    Basic statistics of the dataset
    :param dataset: Pandas dataframe
    :return: None, print the basic statistics of the dataset
    """
    print dataset.describe()

As I said before, We are using pandas describe method to get the basic statistics of the dataset.

Now let’s call this function and check the what it’s outputting.

def main():
    """
    Main function
    :return:
    """
    # Load the csv file into pandas dataframe
    dataset = pd.read_csv(OUTPUT_PATH)
    # Get basic statistics of the loaded dataset
    dataset_statistics(dataset)

if __name__ == "__main__":
    main()

Script Output

CodeNumber  ClumpThickness  UniformityCellSize  UniformityCellShape
count  6.980000e+02      698.000000          698.000000           698.000000   
mean   1.071807e+06        4.416905            3.137536             3.210602   
std    6.175323e+05        2.817673            3.052575             2.972867   
min    6.163400e+04        1.000000            1.000000             1.000000   
25%    8.702582e+05        2.000000            1.000000             1.000000   
50%    1.171710e+06        4.000000            1.000000             1.000000   
75%    1.238354e+06        6.000000            5.000000             5.000000   
max    1.345435e+07       10.000000           10.000000            10.000000   

       MarginalAdhesion  SingleEpithelialCellSize  BlandChromatin
count        698.000000                698.000000      698.000000   
mean           2.809456                  3.217765        3.438395   
std            2.856606                  2.215408        2.440056   
min            1.000000                  1.000000        1.000000   
25%            1.000000                  2.000000        2.000000   
50%            1.000000                  2.000000        3.000000   
75%            4.000000                  4.000000        5.000000   
max           10.000000                 10.000000       10.000000   

       NormalNucleoli     Mitoses  CancerType  
count      698.000000  698.000000  698.000000  
mean         2.869628    1.590258    2.690544  
std          3.055004    1.716162    0.951596  
min          1.000000    1.000000    2.000000  
25%          1.000000    1.000000    2.000000  
50%          1.000000    1.000000    2.000000  
75%          4.000000    1.000000    4.000000  
max         10.000000   10.000000    4.000000

If you observe the above statistics clearly you can identify that we are missing one column details. The column header we are missing is BareNuclei.

Now open the CSV file and check the column details of Bare Nuclei. You will find the missing values replaced with ? Which means we need to treat those values as missing values.

As we know the missing values column and the character used to represent the missing values. Now let’s write a simple function will take the dataset, header_name and missing value representing the character as input handles the missing values.

def handel_missing_values(dataset, missing_values_header, missing_label):
    """
    Filter missing values from the dataset
    :param dataset:
    :param missing_values_header:
    :param missing_label:
    :return:
    """

    return dataset[dataset[missing_values_header] != missing_label]

Now let’s call this function inside the main function

def main():
    """
    Main function
    :return:
    """
    # Headers 
    headers = ["CodeNumber", "ClumpThickness", "UniformityCellSize", "UniformityCellShape", "MarginalAdhesion",
               "SingleEpithelialCellSize", "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses",
               "CancerType"]
    # Load the csv file into pandas dataframe
    dataset = pd.read_csv(OUTPUT_PATH)
    # Get basic statistics of the loaded dataset
    dataset_statistics(dataset)

    # Filter missing values
    dataset = handel_missing_values(dataset, HEADERS[6], '?')
if __name__ == "__main__":
    main()

Now our dataset is missing values free. Now let’s split the data into train and test dataset. The training dataset will use to train the random forest classifier and the test dataset used the validate the model random forest classifier.

Split data into train and test datasets

To split the data into train and test dataset, Let’s write a function which takes the dataset, train percentage, feature header names and target header name as inputs and returns the train_x, test_x, train_y and test_y as outputs.

def split_dataset(dataset, train_percentage, feature_headers, target_header):
    """
    Split the dataset with train_percentage
    :param dataset:
    :param train_percentage:
    :param feature_headers:
    :param target_header:
    :return: train_x, test_x, train_y, test_y
    """

    # Split dataset into train and test dataset
    train_x, test_x, train_y, test_y = train_test_split(dataset[feature_headers], dataset[target_header],
                                                        train_size=train_percentage)
    return train_x, test_x, train_y, test_y

Now let’s call the above function inside the main function.

def main():
    """
    Main function
    :return:
    """
    # Load the csv file into pandas dataframe
    dataset = pd.read_csv(OUTPUT_PATH)
    # Get basic statistics of the loaded dataset
    dataset_statistics(dataset)

    # Filter missing values
    dataset = handel_missing_values(dataset, HEADERS[6], '?')
    train_x, test_x, train_y, test_y = split_dataset(dataset, 0.7, HEADERS[1:-1], HEADERS[-1])


if __name__ == "__main__":
    main()

Where the HEADERS[1:-1] contains all the features header names. We eliminated the first and last header names. The first header name is id and the last header the target header. The HEADERS[-1] contains the target header name.

We can print the shape of the train_x, test_x, train_y and test_y to check whether the split proper or not.

def main():
    """
    Main function
    :return:
    """
    # Load the csv file into pandas dataframe
    dataset = pd.read_csv(OUTPUT_PATH)
    # Get basic statistics of the loaded dataset
    dataset_statistics(dataset)

    # Filter missing values
    dataset = handel_missing_values(dataset, HEADERS[6], '?')
    train_x, test_x, train_y, test_y = split_dataset(dataset, 0.7, HEADERS[1:-1], HEADERS[-1])

    # Train and Test dataset size details
    print "Train_x Shape :: ", train_x.shape
    print "Train_y Shape :: ", train_y.shape
    print "Test_x Shape :: ", test_x.shape
    print "Test_y Shape :: ", test_y.shape

if __name__ == "__main__":
    main()

Script output:

Train_x Shape ::  (477, 9)
Train_y Shape ::  (477,)
Test_x Shape ::  (205, 9)
Test_y Shape ::  (205,)

From the above result, it’s clear that the train and test split was proper. Now let’s build the random forest classifier using the train_x and train_y datasets.

Training random forest classifier with scikit learn

To train the random forest classifier we are going to use the below random_forest_classifier function. Which requires the features (train_x) and target (train_y) data as inputs and returns the train random forest classifier as output.

def random_forest_classifier(features, target):
    """
    To train the random forest classifier with features and target data
    :param features:
    :param target:
    :return: trained random forest classifier
    """
    clf = RandomForestClassifier()
    clf.fit(features, target)
    return clf

Now let’s call the above function inside the main function and print the trained classifier.

def main():
    """
    Main function
    :return:
    """
    # Load the csv file into pandas dataframe
    dataset = pd.read_csv(OUTPUT_PATH)

    # Filter missing values
    dataset = handel_missing_values(dataset, HEADERS[6], '?')
    train_x, test_x, train_y, test_y = split_dataset(dataset, 0.7, HEADERS[1:-1], HEADERS[-1])

    # Create random forest classifier instance
    trained_model = random_forest_classifier(train_x, train_y)
    print "Trained model :: ", trained_model
if __name__ == "__main__":
    main()

Script Output:

Trained model ::  RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

Perform predictions

As we model the classifier. Now let’s take few observations from test dataset and print what our model is predicting and what the actual target.

To do that first let’s predict target for all the test features (test_x) using the trained classifier. Later we will see what our trained model is predicting and what the actual output could be.

def main():
    """
    Main function
    :return:
    """
    # Load the csv file into pandas dataframe
    dataset = pd.read_csv(OUTPUT_PATH)

    # Filter missing values
    dataset = handel_missing_values(dataset, HEADERS[6], '?')
    train_x, test_x, train_y, test_y = split_dataset(dataset, 0.7, HEADERS[1:-1], HEADERS[-1])

    # Create random forest classifier instance
    trained_model = random_forest_classifier(train_x, train_y)
    print "Trained model :: ", trained_model
    predictions = trained_model.predict(test_x)

    for i in xrange(0, 5):
        print "Actual outcome :: {} and Predicted outcome :: {}".format(list(test_y)[i], predictions[i])
if __name__ == "__main__":
    main()

First I converted the test_y into list object from pandas dataframe. The reason is as we randomly split the train and test dataset the indexes of the test_y won’t be in order. If we convert the dataframe in to list object the indexes will be in order.

From the above code, we are printing the first 5 values of test_y and the predict results target.

Actual outcome :: 2 and Predicted outcome :: 2
Actual outcome :: 2 and Predicted outcome :: 2
Actual outcome :: 2 and Predicted outcome :: 2
Actual outcome :: 2 and Predicted outcome :: 2
Actual outcome :: 4 and Predicted outcome :: 4

Seems like the trained classifier predicted the first 5 target classes correctly. To know more about the model let’s check the train and test accuracy information.

Accuracy calculations

def main():
    """
    Main function
    :return:
    """
    # Load the csv file into pandas dataframe
    dataset = pd.read_csv(OUTPUT_PATH)

    # Filter missing values
    dataset = handel_missing_values(dataset, HEADERS[6], '?')
    train_x, test_x, train_y, test_y = split_dataset(dataset, 0.7, HEADERS[1:-1], HEADERS[-1])

    # Create random forest classifier instance
    trained_model = random_forest_classifier(train_x, train_y)
    print "Trained model :: ", trained_model
    predictions = trained_model.predict(test_x)

    # Train and Test Accuracy
    print "Train Accuracy :: ", accuracy_score(train_y, trained_model.predict(train_x))
    print "Test Accuracy  :: ", accuracy_score(test_y, predictions)

if __name__ == "__main__":
    main()

To calculate the accuracy we are using scikit learn the accuracy_score method.

Script output:

Train Accuracy ::  0.991614255765
Test Accuracy  ::  0.970731707317

Our trained classifier model giving 99% accuracy for train dataset and 97% accuracy for test dataset.

Confusion matrix

To know the ture_postive and true_negative details. Let’s print the confusion matrix of our trained classifier.

def main():
    """
    Main function
    :return:
    """
    # Load the csv file into pandas dataframe
    dataset = pd.read_csv(OUTPUT_PATH)

    # Filter missing values
    dataset = handel_missing_values(dataset, HEADERS[6], '?')
    train_x, test_x, train_y, test_y = split_dataset(dataset, 0.7, HEADERS[1:-1], HEADERS[-1])

    # Create random forest classifier instance
    trained_model = random_forest_classifier(train_x, train_y)
    predictions = trained_model.predict(test_x)

    print "Train Accuracy :: ", accuracy_score(train_y, trained_model.predict(train_x))
    print "Test Accuracy  :: ", accuracy_score(test_y, predictions)
    print " Confusion matrix ", confusion_matrix(test_y, predictions)


if __name__ == "__main__":
    main()

Script output:

Train Accuracy ::  0.991614255765
Test Accuracy  ::  0.970731707317
 Confusion matrix  [[123   5]
 [  1  76]]

You can get the complete code below. If you know GitHub, you can clone the complete code from our Github account.

Complete Code:

# Required Python Packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

import pdb

# File Paths
INPUT_PATH = "../inputs/breast-cancer-wisconsin.data"
OUTPUT_PATH = "../inputs/breast-cancer-wisconsin.csv"

# Headers
HEADERS = ["CodeNumber", "ClumpThickness", "UniformityCellSize", "UniformityCellShape", "MarginalAdhesion",
           "SingleEpithelialCellSize", "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses", "CancerType"]


def read_data(path):
    """
    Read the data into pandas dataframe
    :param path:
    :return:
    """
    data = pd.read_csv(path)
    return data


def get_headers(dataset):
    """
    dataset headers
    :param dataset:
    :return:
    """
    return dataset.columns.values


def add_headers(dataset, headers):
    """
    Add the headers to the dataset
    :param dataset:
    :param headers:
    :return:
    """
    dataset.columns = headers
    return dataset


def data_file_to_csv():
    """

    :return:
    """

    # Headers
    headers = ["CodeNumber", "ClumpThickness", "UniformityCellSize", "UniformityCellShape", "MarginalAdhesion",
               "SingleEpithelialCellSize", "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses",
               "CancerType"]
    # Load the dataset into Pandas data frame
    dataset = read_data(INPUT_PATH)
    # Add the headers to the loaded dataset
    dataset = add_headers(dataset, headers)
    # Save the loaded dataset into csv format
    dataset.to_csv(OUTPUT_PATH, index=False)
    print "File saved ...!"


def split_dataset(dataset, train_percentage, feature_headers, target_header):
    """
    Split the dataset with train_percentage
    :param dataset:
    :param train_percentage:
    :param feature_headers:
    :param target_header:
    :return: train_x, test_x, train_y, test_y
    """

    # Split dataset into train and test dataset
    train_x, test_x, train_y, test_y = train_test_split(dataset[feature_headers], dataset[target_header],
                                                        train_size=train_percentage)
    return train_x, test_x, train_y, test_y


def handel_missing_values(dataset, missing_values_header, missing_label):
    """
    Filter missing values from the dataset
    :param dataset:
    :param missing_values_header:
    :param missing_label:
    :return:
    """

    return dataset[dataset[missing_values_header] != missing_label]


def random_forest_classifier(features, target):
    """
    To train the random forest classifier with features and target data
    :param features:
    :param target:
    :return: trained random forest classifier
    """
    clf = RandomForestClassifier()
    clf.fit(features, target)
    return clf


def dataset_statistics(dataset):
    """
    Basic statistics of the dataset
    :param dataset: Pandas dataframe
    :return: None, print the basic statistics of the dataset
    """
    print dataset.describe()


def main():
    """
    Main function
    :return:
    """
    # Load the csv file into pandas dataframe
    dataset = pd.read_csv(OUTPUT_PATH)
    # Get basic statistics of the loaded dataset
    dataset_statistics(dataset)

    # Filter missing values
    dataset = handel_missing_values(dataset, HEADERS[6], '?')
    train_x, test_x, train_y, test_y = split_dataset(dataset, 0.7, HEADERS[1:-1], HEADERS[-1])

    # Train and Test dataset size details
    print "Train_x Shape :: ", train_x.shape
    print "Train_y Shape :: ", train_y.shape
    print "Test_x Shape :: ", test_x.shape
    print "Test_y Shape :: ", test_y.shape

    # Create random forest classifier instance
    trained_model = random_forest_classifier(train_x, train_y)
    print "Trained model :: ", trained_model
    predictions = trained_model.predict(test_x)

    for i in xrange(0, 5):
        print "Actual outcome :: {} and Predicted outcome :: {}".format(list(test_y)[i], predictions[i])

    print "Train Accuracy :: ", accuracy_score(train_y, trained_model.predict(train_x))
    print "Test Accuracy  :: ", accuracy_score(test_y, predictions)
    print " Confusion matrix ", confusion_matrix(test_y, predictions)


if __name__ == "__main__":
    main()

Summary

In this article, you learned how to implement the most popular classification algorithm random forest in python using python scikit learn package.

On process, you learned how to handle the missing values. Finally, you learned how to calculate the accuracy of any trained classifier using the scikit learn accuray_score method.

Recommended Data Science Courses

30 Responses to “Building Random Forest Classifier with Python Scikit learn”

Sagar Kalra
5 years ago
Reply

You are my love bro :*
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Sagar Kalra,
  
  Thanks for the compliment.
  
  We wish you a very happy learning.
Manjunath
6 years ago
Reply

KEY ERROR :BareNuclei
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Manjunath,
  
  The key you need to consider is BareNuclei not :BareNuclei. Could you please try this key and let me know if you still face the issue.
Manjunath
6 years ago
Reply

File b’../inputs/breast-cancer-wisconsin.csv’ does not exist
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Manjunath,
  
  You can download the data from the UCI breast cancer dataset. I am sharing the link.
  
  Link: https://archive.ics.uci.edu/ml/datasets/breast+cancer
  
  Thanks and happy learning!
Manjunath
6 years ago
Reply

HI sir, I am facing issues like unexpected indent please help me
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Manjunath,
  
  When you copy the code in the article, Please check the indentation is properly followed in the code editor you are using, You can compare the code in the article and in your editor. I hope this will resolve issue. If not let me know.
  
  Thanks and happy learning!
Sasha
6 years ago
Reply

If you wold have added the data set (the csv file) it would have been great.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Sasha,
  
  You can download the data from the UCI breast cancer dataset. I am sharing the link.
  
  Link: https://archive.ics.uci.edu/ml/datasets/breast+cancer
  
  Thanks and happy learning!
Xiaoping
6 years ago
Reply

Can I specify max-number of sub-tree in the random forest modelling method?
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Xiaoping,
  
  You can try out the best max number of sub-tree for random forest using the grid parameter search instead of manually trying different max-number values.
  
  Thanks and happy learning.
lol
6 years ago
Reply

Great article, thanks!
- Saimadhu Polamuri
  6 years ago
  Reply
  
  Hi Lol,
  Thanks for your compliment 🙂
Shubhangi
6 years ago
Reply

Hello,
I am still getting issue with importing the libraries. Please help.
- Saimadhu Polamuri
  6 years ago
  Reply
  
  Hi Shubhangi,
  
  I hope you haven’t installed the python machine learning packages properly. Please follow how to create the machine learning python envirnoment article.
Sakthi
6 years ago
Reply

How exactly does the handle missing values function work in the above code? I’m new to pandas
- Saimadhu Polamuri
  6 years ago
  Reply
  
  Hi Sakthi,
  
  The function in the article which handles the missing values is pretty simple one. From the data itself, we know that having “?” represents the missing observation. So in the handling missing values function, we are just checking if the observation is having “?” as value then we are not considering those observations.
Tanveer Ahmed
7 years ago
Reply

first, thank you for such a detailed explanation on Machine Learning.
But I am facing error while running the random forest example:
###
Traceback (most recent call last):
File “”, line 2, in
main()
File “”, line 11, in main
dataset = handel_missing_values(dataset, HEADERS[6], ‘7’)
NameError: global name ‘HEADERS’ is not defined
####
This error is observed when the below code is executed:
###
def main():
“””
Main function
:return:
“””
# Load the csv file into pandas dataframe
dataset = pd.read_csv(OUTPUT_PATH)
# Get basic statistics of the loaded dataset
dataset_statistics(dataset)
# Filter missing values
dataset = handel_missing_values(dataset, HEADERS[6], ‘7’)
###
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Tanveer,
  
  Thanks for your compliment 🙂
  
  We missed adding the HEADERS list in the code, Which rise the error.
  Updated the article, Have a look.
Shon
7 years ago
Reply

Thanks for your clear and complete article!
If it is possible, plotting of the results could be helpful.
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Shon,
  Thanks a lot for your compliment 🙂
  
  Will try to publish an article on how to visualize the trained random forest classifier.
Shon
7 years ago
Reply

I strongly recommend this article who want to know how to write the lines of RF and begin ML. And thank you as one of them.
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Shon,
  
  Thanks a lot for your compliment 🙂 .
rebeen
7 years ago
Reply

thank you that is really nice
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Rebeen,
  
  Thanks for your compliment. 🙂
Bharat Singh
7 years ago
Reply

Very Nice explanation!
Real word example is very interesting. It’s give depth knowledge about the RF working.
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Bharat Singh,
  Thanks for your compliment. 🙂
Andrew
7 years ago
Reply

Hello,

I found an error data_file_to_csv() is not called in main().