How the random forest algorithm works in machine learning

May 22, 2017 Saimadhu Polamuri

Random Forest Introduction

Introduction to Random Forest Algorithm

In this article, you are going to learn the most popular classification algorithm. Which is the random forest algorithm. In machine learning way fo saying the random forest classifier. As a motivation to go further I am going to give you one of the best advantages of random forest.

Random forest algorithm can use both for classification and the regression kind of problems.

The Same algorithm both for classification and regression, You mind be thinking I am kidding. But the truth is, Yes we can use the same random forest algorithm both for classification and regression.

Excited, I do have the same feeling when I first heard about the advantage of the random forest algorithm. Which is the same algorithm can use for both regression and classification problems.

In this article, you are going to learn, how the random forest algorithm works in machine learning for the classification task. In the next coming article, you can learn about how the random forest algorithm can use for regression.

Get a cup of coffee before you begin, As this is going to be a long article 😛

We begin with the table of contents.

How random forest classifier works for classification. Click To Tweet

What is the Random Forest algorithm?
- Decision tree concepts
Why random forest algorithm
Random forest algorithm real-life example
- Decision Tree Example
- Random forest Example
How Random forest algorithm works
- Pseudocode Random forest creation
- Pseudocode for performing predictions
Random forest algorithm Applications
- Banking
- Medicine
- Stock Market
- E-Commerce
Advantages of the random forest algorithm

What is Random Forest algorithm?

The random forest algorithm is a supervised classification algorithm. As the name suggests, this algorithm creates the forest with a number of trees.

In general, the more trees in the forest the more robust the forest looks like. In the same way in the random forest classifier, the higher the number of trees in the forest gives the high the accuracy results.

If you know the decision tree algorithm. You might be thinking are we creating more number of decision trees and how can we create more number of decision trees. As all the calculation of nodes selection will be the same for the same dataset.

Yes. You are true. To model more number of decision trees to create the forest you are not going to use the same apache of constructing the decision with information gain or Gini index approach.

If you are not aware of the concepts of decision tree classifier, Please spend some time on the below articles, As you need to know how the decision tree classifier works before you learning the working nature of the random forest algorithm. If you would like to learn the implementation of the decision tree classifier, you can check it out from the below articles.

If you are new to the concept of the decision tree. I am giving you a basic overview of the decision tree.

Basic decision tree concept

The decision tree concept is more to the rule-based system. Given the training dataset with targets and features, the decision tree algorithm will come up with some set of rules. The same set rules can be used to perform the prediction on the test dataset.

Suppose you would like to predict that your daughter will like the newly released animation movie or not. To model the decision tree you will use the training dataset like the animated cartoon characters your daughter liked in the past movies.

So once you pass the dataset with the target as your daughter will like the movie or not to the decision tree classifier. The decision tree will start building the rules with the characters your daughter likes as nodes and the targets like or not as the leaf nodes. By considering the path from the root node to the leaf node. You can get the rules.

The simple rule could be if some x character is playing the leading role then your daughter will like the movie. You can think a few more rules based on this example.

Then to predict whether your daughter will like the movie or not. You just need to check the rules which are created by the decision tree to predict whether your daughter will like the newly released movie or not.

In decision tree algorithm calculating these nodes and forming the rules will happen using the information gain and Gini index calculations.

In a random forest algorithm, Instead of using information gain or Gini index for calculating the root node, the process of finding the root node and splitting the feature nodes will happen randomly. Will look about in detail in the coming section.

Next, you are going to learn why random forest algorithm? When we are having other classification algorithms to play with.

Why Random forest algorithm

To address why random forest algorithm. I am giving you the below advantages.

The same random forest algorithm or the random forest classifier can use for both classification and the regression task.
Random forest classifier will handle the missing values.
When we have more trees in the forest, a random forest classifier won’t overfit the model.
Can model the random forest classifier for categorical values also.

Will discuss this advantage in the random forest algorithm advantages section of this article. Until think through the above advantages of the random forest algorithm compared to the other classification algorithms.

Random forest algorithm real-life example

Random Forest Example

Before you drive into the technical details about the random forest algorithm. Let’s look into a real-life example to understand the layman type of random forest algorithm.

Suppose Mady somehow got 2 weeks’ leave from his office. He wants to spend his 2 weeks traveling to a different place. He also wants to go to the place he may like.

So he decided to ask his best friend about the places he may like. Then his friend started asking about his past trips. It’s just like his best friend will ask, You have been visited the X place did you like it?

Based on the answers which are given by Mady, his best start recommending the place Mady may like. Here his best formed the decision tree with the answer given by Mady.

As his best friend may recommend his best place to Mady as a friend. The model will be biased with the closeness of their friendship. So he decided to ask a few more friends to recommend the best place he may like.

Now his friends asked some random questions and each one recommended one place to Mady. Now Mady considered the place which is high votes from his friends as the final place to visit.

In the above Mady trip planning, two main interesting algorithms decision tree algorithm and random forest algorithm used. I hope you find it already. Anyhow, I would like to highlight it again.

Decision Tree:

To recommend the best place for Mady, his best friend asked some questions. Based on the answers given by mady, he recommended a place. This is decision tree algorithm approach. Will explain why it is a decision tree algorithm approach.

Mady friend used the answers given by mady to create rules. Later he used the created rules to recommend the best place which mady will like. These rules could be, mady like a place with lots of tree or waterfalls ..etc

In the above approach mady best friend is the decision tree. The vote (recommended place) is the leaf of the decision tree (Target class). The target is finalized by a single person, In a technical way of saying, using only a single decision tree.

Random Forest Algorithm:

In the other case when mady asked his friends to recommend the best place to visit. Each friend asked him different questions and come up with their recommend a place to visit. Later mady consider all the recommendations and calculated the votes. Votes basically are to pick the popular place from the recommend places from all his friends.

Mady will consider each recommended place and if the same place recommended by some other place he will increase the count. In the end, the high count place where mady will go.

In this case, the recommended place (Target Prediction) is considered by many friends. Each friend is the tree and the combined all friends will form the forest. This forest is random forest. As each friend asked random questions to recommend the best place visit.

Now let’s use the above example to understand how the random forest algorithm work.

How Random forest algorithm works

How random forest algorithm works

Let’s look at the pseudocode for random forest algorithm and later we can walk through each step in the random forest algorithm.

The pseudocode for random forest algorithms can split into two stages.

Random forest creation pseudocode.
Pseudocode to perform prediction from the created random forest classifier.

First, let’s begin with random forest creation pseudocode

Random Forest pseudocode:

Randomly select “k” features from total “m” features.
1. Where k << m
Among the “k” features, calculate the node “d” using the best split point.
Split the node into daughter nodes using the best split.
Repeat 1 to 3 steps until “l” number of nodes has been reached.
Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of trees.

The beginning of random forest algorithm starts with randomly selecting “k” features out of total “m” features. In the image, you can observe that we are randomly taking features and observations.

In the next stage, we are using the randomly selected “k” features to find the root node by using the best split approach.

In the next stage, We will be calculating the daughter nodes using the same best split approach. Will the first 3 stages until we form the tree with a root node and having the target as the leaf node.

Finally, we repeat 1 to 4 stages to create “n” randomly created trees. These randomly created trees form the random forest.

Random forest prediction pseudocode:

To perform prediction using the trained random forest algorithm uses the below pseudocode.

Takes the test features and use the rules of each randomly created decision tree to predict the oucome and stores the predicted outcome (target)
Calculate the votes for each predicted target.
Consider the high voted predicted target as the final prediction from the random forest algorithm.

To perform the prediction using the trained random forest algorithm we need to pass the test features through the rules of each randomly created trees. Suppose let’s say we formed 100 random decision trees to from the random forest.

Each random forest will predict a different targets (outcomes) for the same test feature. Then by considering each predicted target votes will be calculated. Suppose the 100 random decision trees are prediction some 3 unique targets x, y, z then the votes of x is nothing but out of 100 random decision tree how many trees prediction is x.

Likewise for the other 2 targets (y, z). If x is getting high votes. Let’s say out of 100 random decision tree 60 trees are predicting the target will be x. Then the final random forest returns the x as the predicted target.

This concept of voting is known as majority voting.

Now let’s look into a few applications of random forest algorithm.

Random forest algorithm applications

Random Forest Applications

The random algorithm used in wide varieties of applications. In this article, we are going to address a few of them.

Below are some the application where the random forest algorithm is widely used.

Banking
Medicine
Stock Market
E-commerce

Let’s begin with the banking sector.

1.Banking:

In the banking sector, a random forest algorithm widely used in two main applications. These are for finding loyal customers and finding fraud customers.

The loyal customer means not the customer who pays well, but also the customer who can take the huge amount as loan and pays the loan interest properly to the bank. As the growth of the bank purely depends on loyal customers. The bank customer’s data highly analyzed to find the pattern for the loyal customer based on the customer details.

In the same way, there is a need to identify the customer who is not profitable for the bank, like taking the loan and paying the loan interest properly or find outlier customers. If the bank can identify theses kind of customer before giving the loan the customer. Bank will get a chance to not approve the loan to these kinds of customers. In this case, also random forest algorithm is used to identify the customers who are not profitable for the bank.

2.Medicine

In the medicine field, a random forest algorithm is used to identify the correct combination of the components to validate the medicine. Random forest algorithm also helpful for identifying the disease by analyzing the patient’s medical records.

3.Stock Market

In the stock market, a random forest algorithm used to identify the stock behavior as well as the expected loss or profit by purchasing the particular stock.

4.E-commerce

In e-commerce, the random forest used only in the small segment of the recommendation engine for identifying the likely hood of customers liking the recommend products base on similar kinds of customers.

Running a random forest algorithm on a very large dataset requires high-end GPU systems. If you are not having any GPU system. You can always run the machine learning models in cloud-hosted desktop. You can use clouddesktoponline platform to run high-end machine learning models from sitting any corner of the world.

Advantages of random forest algorithm

Below are the advantages of random forest algorithm compared with other classification algorithms.

The overfitting problem will never come when we use the random forest algorithm in any classification problem.
The same random forest algorithm can be used for both classification and regression tasks.
The random forest algorithm can be used for feature engineering.
- This means identifying the most important features out of the available features from the training dataset.

Related Data Science Courses

47 Responses to “How the random forest algorithm works in machine learning”

shaymaa
2 years ago
Reply

hi i want random forest feature selection code in python and plot feature selection
Ajay Sharma
4 years ago
Reply

One of the best blogs that i have read still now. Thanks for your contribution in sharing such a useful information. Waiting for your further updates.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Thanks Ajay Sharma 🙂
Ayush Maheshwari
4 years ago
Reply

please explain how random forest identify this is the most important features out of the available features from the training dataset.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Ayush Maheshwari,
  
  To identify a feature is important in the random forest we can calculating different methods, like information gain, Gini index. Once we calculated these methods score for all available features, the model will pick the best score feature at each root node.
  
  Thanks and happy learning!
Hafiza Iqra Naz
5 years ago
Reply

i want to know specifically about decision tree& random forest nd also have some questions in mind
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Hafiza Iqra Naz,
  
  You can check the below articles to learn more about the decision tree & random forest.
  
  Links:
  https://dataaspirant.com/how-decision-tree-algorithm-works/
  https://dataaspirant.com/decision-tree-algorithm-python-with-scikit-learn/
  https://dataaspirant.com/random-forest-algorithm-machine-learing/
  https://dataaspirant.com/random-forest-classifier-python-scikit-learn/
  
  Let me know if you need any other sources for the same topics.
  
  Happy learning!
Soumyabrata Bhattacharjee
5 years ago
Reply

The word “excellent“ is not enough for this kind of explanation.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Soumyabrata,
  
  Thanks for the compliment.
  
  We wish you a very happy learning.
Aaron Evans
5 years ago
Reply

Great real world example. Different decision trees with random sampling and accumulated total.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Aaron Evans,
  
  Thanks for the compliment.
  
  We wish you a very happy learning.
kiran
5 years ago
Reply

Very Simplified. Created interest to dig deper into another concepts
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi kiran,
  
  Thanks for the compliment.
  
  Happy learning.
Surya
5 years ago
Reply

Hi,
Thanks for the article.. Can you also explain what is out of bag error in random forest
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Surya,
  
  Thanks for your compliment. When we are having very limited data for training the model, further it’s hard to split the data into train and test, So in such a case, it’s always best to use the bag of error method. In the bag of error case for the randomly created tree in the forest, we create a validation dataset. The data points in the validation dataset will never be included in training that particular tree. In scikit learn you can turn on the oob_score true to use this method.
  
  Thanks and happy learning!
Boris
5 years ago
Reply

Full of typos & language errors and very hard to read… I though it is supposed to be proofread before publishing, no?
I still appreciate the material, but the way it is given is a bit disappointing.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Boris,
  
  Thanks a lot for the feedback. We have updated the article. Please have a look.
  
  Thanks and happy learning!
Daniel
5 years ago
Reply

Awesome explanation! It would be super cool, if you could write another explanation about gradient boosting and gradient boosting compared to random forests.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Daniel,
  
  Thanks for your suggestion, we will incorporate in the article.
  
  Thanks and happy learning!
Anand
6 years ago
Reply

Great explanation. There are many grammatical errors. Please remove them to make the reading all the more enjoyable. Thanks.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Anand,
  
  Thanks for the suggestion we will try to minimize the grammatical errors.
  
  We wish you a very happy learning.
Bruce
6 years ago
Reply

I can´t really say if I liked the article cause I found it quite hard to understand, not due to the concepts but to the poor writing. It seems as though you wrote the words as they came to your mind, with no further revision.

I strongly recommend a rewriting.
- Saimadhu Polamuri
  6 years ago
  Reply
  
  Hi Bruce,
  
  Thanks for your suggestion. Will do other revision now.
- Abhishek Kumar
  5 years ago
  Reply
  
  HI , I want to say that this post is awesome thanks for sharing this with us . Can you please tell something about python too
  - Saimadhu Polamuri
    4 years ago
    Reply
    
    Hi Abhishek,
    
    Thanks for the compliment. We are planning for the python tutorials series, will go live soon.
    
    Happy learning.
Ash
6 years ago
Reply

You’re doing a brilliant work by writing out such posts, nobody can complain in understanding them.
However, I have a doubt when you mentioned selecting ‘k’ features out of ‘m’ features. Are the features not formed along columns? If so, then you have 5 features in every random tree.
Can you please explain?
- Saimadhu Polamuri
  6 years ago
  Reply
  
  Hi Ash,
  
  Thanks for you compliment 🙂
  
  Selecting ‘k’ features out of ‘m’ is basically selecting few columns for all the columns of the data we have. In the example I have taken 5 in all the random trees, In general this number will vary all the time.
  - Ash
    6 years ago
    Reply
    
    Gotcha! Thanks for answering. So, how could we ensure that all the features are represented if they’re selected randomly?
    - Saimadhu Polamuri
      6 years ago
      Reply
      
      That’s what random forest algorithm does.
gunjan dogra
6 years ago
Reply

Content is good, but not visible due to wrong colour scheme.
- Saimadhu Polamuri
  6 years ago
  Reply
  
  Hi Gunjan Dogra,
  
  Thanks for your compliment 🙂 . We were glad for your suggestion, As for our brand, it’s hard for us to change the color scheme. Hope you understand.
TARUN VOORA
6 years ago
Reply

Very Nice Bro.. Thanks for your articles. Learned a lot from those and by the way we are from same college 🙂
- Saimadhu Polamuri
  6 years ago
  Reply
  
  Hi Tarun Voora,
  
  Thanks for your compliment 🙂
rahul
7 years ago
Reply

Thank you, this is one of this best site I ever visited, very good explanation easy to understand.
Just I want to know how random forest or decision tree create/set rules to get a prediction.
for example :
label features prediction
9 [9.0,3.0,1.2274] 8.753939659
6 [6.0,3.0,1.2274] 6.392596619
6 [6.0,2.84,0.89] 6.114090909
5 [5.0,3.0,1.2274] 5.902340209
8 [8.0,2.775,1.2593] 7.758650794
please c an you explain me how it will go from root node respective this example it will be more helpful.
Thank you in advance.
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Rahul,
  
  Thanks a lot for your compliment 🙂
  
  You can have a look at how to visualize the trained decision tree article. In this article, we have explained how the trained decision tree classifier uses the trained rules to make the prediction.
Claude-A. Joseph
7 years ago
Reply

Hi, Saimadhu Polamuri,
Thank you, u really explained it well.
i would like to learn about How random forest classifier works for classification. Have you already published this article ?
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Calude,
  
  Thanks for your compliment. We published an article explaining how to use the random forest classifier to perform classification task. Do have a look. https://dataaspirant.com/2017/06/26/random-forest-classifier-python-scikit-learn/
  - Joseph Claude-A
    7 years ago
    Reply
    
    Thank you for your answer and I am sorry. I would like talking about how random forest works for the regression.
    Thank
    - Saimadhu Polamuri
      7 years ago
      Reply
      
      Hi Calude,
      You can use the scikit learn RandomForestRegressor to model regression models. Will try to write an article on how to use the random forest algorithm for regression models.
Anagha Tamhankar
7 years ago
Reply

Hi Saimadhu, U really explained it well….I just want to understand the backhand process of the decision tree. Where can I find that?
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Anagha Tamhankar,
  
  Thanks for your compliment 🙂
  We have written an article on how the decision tree algorithm works. You can have a look.
  
  https://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/
Jisha
7 years ago
Reply

Well Explained concepts. Thanks for your time in writing the Article.
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Jisha,
  Thanks for your compliment. 🙂
Marcian Fernando
7 years ago
Reply

Random forest explained well. Great Job!
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Marcian Fernando,
  Thanks for your compliment. 🙂
  - Mamta
    5 years ago
    Reply
    
    Hi, can you please tell me when random forest buulds tree , does it pick up only random attributes or also random tuples ?
    - Saimadhu Polamuri
      4 years ago
      Reply
      
      Hi Mamta,
      
      The random forest tree build considers random features and random observations to build each tree.
      
      Thanks and happy learning!