How the random forest algorithm works in machine learning
Introduction to Random Forest Algorithm
In this article, you are going to learn the most popular classification algorithm. Which is the random forest algorithm. In machine learning way fo saying the random forest classifier. As a motivation to go further I am going to give you one of the best advantages of random forest.
Random forest algorithm can use both for classification and the regression kind of problems.
The Same algorithm both for classification and regression, You mind be thinking I am kidding. But the truth is, Yes we can use the same random forest algorithm both for classification and regression.
Excited, I do have the same feeling when I first heard about the advantage of the random forest algorithm. Which is the same algorithm can use for both regression and classification problems.
In this article, you are going to learn, how the random forest algorithm works in machine learning for the classification task. In the next coming article, you can learn about how the random forest algorithm can use for regression.
Get a cup of coffee before you begin, As this is going to be a long article 😛
We begin with the table of contents.
How random forest classifier works for classification. Share on X
Table of Contents:
- What is the Random Forest algorithm?
- Decision tree concepts
- Why random forest algorithm
- Random forest algorithm real-life example
- Decision Tree Example
- Random forest Example
- How Random forest algorithm works
- Pseudocode Random forest creation
- Pseudocode for performing predictions
- Random forest algorithm Applications
- Banking
- Medicine
- Stock Market
- E-Commerce
- Advantages of the random forest algorithm
What is Random Forest algorithm?
The random forest algorithm is a supervised classification algorithm. As the name suggests, this algorithm creates the forest with a number of trees.
In general, the more trees in the forest the more robust the forest looks like. In the same way in the random forest classifier, the higher the number of trees in the forest gives the high the accuracy results.
If you know the decision tree algorithm. You might be thinking are we creating more number of decision trees and how can we create more number of decision trees. As all the calculation of nodes selection will be the same for the same dataset.
Yes. You are true. To model more number of decision trees to create the forest you are not going to use the same apache of constructing the decision with information gain or Gini index approach.
If you are not aware of the concepts of decision tree classifier, Please spend some time on the below articles, As you need to know how the decision tree classifier works before you learning the working nature of the random forest algorithm. If you would like to learn the implementation of the decision tree classifier, you can check it out from the below articles.
If you are new to the concept of the decision tree. I am giving you a basic overview of the decision tree.
Basic decision tree concept
The decision tree concept is more to the rule-based system. Given the training dataset with targets and features, the decision tree algorithm will come up with some set of rules. The same set rules can be used to perform the prediction on the test dataset.
Suppose you would like to predict that your daughter will like the newly released animation movie or not. To model the decision tree you will use the training dataset like the animated cartoon characters your daughter liked in the past movies.
So once you pass the dataset with the target as your daughter will like the movie or not to the decision tree classifier. The decision tree will start building the rules with the characters your daughter likes as nodes and the targets like or not as the leaf nodes. By considering the path from the root node to the leaf node. You can get the rules.
The simple rule could be if some x character is playing the leading role then your daughter will like the movie. You can think a few more rules based on this example.
Then to predict whether your daughter will like the movie or not. You just need to check the rules which are created by the decision tree to predict whether your daughter will like the newly released movie or not.
In decision tree algorithm calculating these nodes and forming the rules will happen using the information gain and Gini index calculations.
In a random forest algorithm, Instead of using information gain or Gini index for calculating the root node, the process of finding the root node and splitting the feature nodes will happen randomly. Will look about in detail in the coming section.
Next, you are going to learn why random forest algorithm? When we are having other classification algorithms to play with.
Why Random forest algorithm
To address why random forest algorithm. I am giving you the below advantages.
- The same random forest algorithm or the random forest classifier can use for both classification and the regression task.
- Random forest classifier will handle the missing values.
- When we have more trees in the forest, a random forest classifier won’t overfit the model.
- Can model the random forest classifier for categorical values also.
Will discuss this advantage in the random forest algorithm advantages section of this article. Until think through the above advantages of the random forest algorithm compared to the other classification algorithms.
Random forest algorithm real-life example
Before you drive into the technical details about the random forest algorithm. Let’s look into a real-life example to understand the layman type of random forest algorithm.
Suppose Mady somehow got 2 weeks’ leave from his office. He wants to spend his 2 weeks traveling to a different place. He also wants to go to the place he may like.
So he decided to ask his best friend about the places he may like. Then his friend started asking about his past trips. It’s just like his best friend will ask, You have been visited the X place did you like it?
Based on the answers which are given by Mady, his best start recommending the place Mady may like. Here his best formed the decision tree with the answer given by Mady.
As his best friend may recommend his best place to Mady as a friend. The model will be biased with the closeness of their friendship. So he decided to ask a few more friends to recommend the best place he may like.
Now his friends asked some random questions and each one recommended one place to Mady. Now Mady considered the place which is high votes from his friends as the final place to visit.
In the above Mady trip planning, two main interesting algorithms decision tree algorithm and random forest algorithm used. I hope you find it already. Anyhow, I would like to highlight it again.
Decision Tree:
To recommend the best place for Mady, his best friend asked some questions. Based on the answers given by mady, he recommended a place. This is decision tree algorithm approach. Will explain why it is a decision tree algorithm approach.
Mady friend used the answers given by mady to create rules. Later he used the created rules to recommend the best place which mady will like. These rules could be, mady like a place with lots of tree or waterfalls ..etc
In the above approach mady best friend is the decision tree. The vote (recommended place) is the leaf of the decision tree (Target class). The target is finalized by a single person, In a technical way of saying, using only a single decision tree.
Random Forest Algorithm:
In the other case when mady asked his friends to recommend the best place to visit. Each friend asked him different questions and come up with their recommend a place to visit. Later mady consider all the recommendations and calculated the votes. Votes basically are to pick the popular place from the recommend places from all his friends.
Mady will consider each recommended place and if the same place recommended by some other place he will increase the count. In the end, the high count place where mady will go.
In this case, the recommended place (Target Prediction) is considered by many friends. Each friend is the tree and the combined all friends will form the forest. This forest is random forest. As each friend asked random questions to recommend the best place visit.
Now let’s use the above example to understand how the random forest algorithm work.
How Random forest algorithm works
Let’s look at the pseudocode for random forest algorithm and later we can walk through each step in the random forest algorithm.
The pseudocode for random forest algorithms can split into two stages.
- Random forest creation pseudocode.
- Pseudocode to perform prediction from the created random forest classifier.
First, let’s begin with random forest creation pseudocode
Random Forest pseudocode:
-
Randomly select “k” features from total “m” features.
- Where k << m
-
Among the “k” features, calculate the node “d” using the best split point.
-
Split the node into daughter nodes using the best split.
-
Repeat 1 to 3 steps until “l” number of nodes has been reached.
-
Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of trees.
The beginning of random forest algorithm starts with randomly selecting “k” features out of total “m” features. In the image, you can observe that we are randomly taking features and observations.
In the next stage, we are using the randomly selected “k” features to find the root node by using the best split approach.
In the next stage, We will be calculating the daughter nodes using the same best split approach. Will the first 3 stages until we form the tree with a root node and having the target as the leaf node.
Finally, we repeat 1 to 4 stages to create “n” randomly created trees. These randomly created trees form the random forest.
Random forest prediction pseudocode:
To perform prediction using the trained random forest algorithm uses the below pseudocode.
-
Takes the test features and use the rules of each randomly created decision tree to predict the oucome and stores the predicted outcome (target)
-
Calculate the votes for each predicted target.
-
Consider the high voted predicted target as the final prediction from the random forest algorithm.
To perform the prediction using the trained random forest algorithm we need to pass the test features through the rules of each randomly created trees. Suppose let’s say we formed 100 random decision trees to from the random forest.
Each random forest will predict a different targets (outcomes) for the same test feature. Then by considering each predicted target votes will be calculated. Suppose the 100 random decision trees are prediction some 3 unique targets x, y, z then the votes of x is nothing but out of 100 random decision tree how many trees prediction is x.
Likewise for the other 2 targets (y, z). If x is getting high votes. Let’s say out of 100 random decision tree 60 trees are predicting the target will be x. Then the final random forest returns the x as the predicted target.
This concept of voting is known as majority voting.
Now let’s look into a few applications of random forest algorithm.
Random forest algorithm applications
The random algorithm used in wide varieties of applications. In this article, we are going to address a few of them.
Below are some the application where the random forest algorithm is widely used.
- Banking
- Medicine
- Stock Market
- E-commerce
Let’s begin with the banking sector.
1.Banking:
In the banking sector, a random forest algorithm widely used in two main applications. These are for finding loyal customers and finding fraud customers.
The loyal customer means not the customer who pays well, but also the customer who can take the huge amount as loan and pays the loan interest properly to the bank. As the growth of the bank purely depends on loyal customers. The bank customer’s data highly analyzed to find the pattern for the loyal customer based on the customer details.
In the same way, there is a need to identify the customer who is not profitable for the bank, like taking the loan and paying the loan interest properly or find outlier customers. If the bank can identify theses kind of customer before giving the loan the customer. Bank will get a chance to not approve the loan to these kinds of customers. In this case, also random forest algorithm is used to identify the customers who are not profitable for the bank.
2.Medicine
In the medicine field, a random forest algorithm is used to identify the correct combination of the components to validate the medicine. Random forest algorithm also helpful for identifying the disease by analyzing the patient’s medical records.
3.Stock Market
In the stock market, a random forest algorithm used to identify the stock behavior as well as the expected loss or profit by purchasing the particular stock.
4.E-commerce
In e-commerce, the random forest used only in the small segment of the recommendation engine for identifying the likely hood of customers liking the recommend products base on similar kinds of customers.
Running a random forest algorithm on a very large dataset requires high-end GPU systems. If you are not having any GPU system. You can always run the machine learning models in cloud-hosted desktop. You can use clouddesktoponline platform to run high-end machine learning models from sitting any corner of the world.
Advantages of random forest algorithm
Below are the advantages of random forest algorithm compared with other classification algorithms.
- The overfitting problem will never come when we use the random forest algorithm in any classification problem.
- The same random forest algorithm can be used for both classification and regression tasks.
- The random forest algorithm can be used for feature engineering.
- This means identifying the most important features out of the available features from the training dataset.
Follow us:
FACEBOOK| QUORA |TWITTER| GOOGLE+ | LINKEDIN| REDDIT | FLIPBOARD | MEDIUM | GITHUB
I hope you like this post. If you have any questions, then feel free to comment below. If you want me to write on one particular topic, then do tell it to me in the comments below.
hi i want random forest feature selection code in python and plot feature selection
One of the best blogs that i have read still now. Thanks for your contribution in sharing such a useful information. Waiting for your further updates.
Thanks Ajay Sharma 🙂
please explain how random forest identify this is the most important features out of the available features from the training dataset.
Hi Ayush Maheshwari,
To identify a feature is important in the random forest we can calculating different methods, like information gain, Gini index. Once we calculated these methods score for all available features, the model will pick the best score feature at each root node.
Thanks and happy learning!
i want to know specifically about decision tree& random forest nd also have some questions in mind
Hi Hafiza Iqra Naz,
You can check the below articles to learn more about the decision tree & random forest.
Links:
https://dataaspirant.com/how-decision-tree-algorithm-works/
https://dataaspirant.com/decision-tree-algorithm-python-with-scikit-learn/
https://dataaspirant.com/random-forest-algorithm-machine-learing/
https://dataaspirant.com/random-forest-classifier-python-scikit-learn/
Let me know if you need any other sources for the same topics.
Happy learning!
The word “excellent“ is not enough for this kind of explanation.
Hi Soumyabrata,
Thanks for the compliment.
We wish you a very happy learning.
Great real world example. Different decision trees with random sampling and accumulated total.
Hi Aaron Evans,
Thanks for the compliment.
We wish you a very happy learning.
Very Simplified. Created interest to dig deper into another concepts
Hi kiran,
Thanks for the compliment.
Happy learning.
Hi,
Thanks for the article.. Can you also explain what is out of bag error in random forest
Hi Surya,
Thanks for your compliment. When we are having very limited data for training the model, further it’s hard to split the data into train and test, So in such a case, it’s always best to use the bag of error method. In the bag of error case for the randomly created tree in the forest, we create a validation dataset. The data points in the validation dataset will never be included in training that particular tree. In scikit learn you can turn on the oob_score true to use this method.
Thanks and happy learning!
Full of typos & language errors and very hard to read… I though it is supposed to be proofread before publishing, no?
I still appreciate the material, but the way it is given is a bit disappointing.
Hi Boris,
Thanks a lot for the feedback. We have updated the article. Please have a look.
Thanks and happy learning!
Awesome explanation! It would be super cool, if you could write another explanation about gradient boosting and gradient boosting compared to random forests.
Hi Daniel,
Thanks for your suggestion, we will incorporate in the article.
Thanks and happy learning!
Great explanation. There are many grammatical errors. Please remove them to make the reading all the more enjoyable. Thanks.
Hi Anand,
Thanks for the suggestion we will try to minimize the grammatical errors.
We wish you a very happy learning.
I can´t really say if I liked the article cause I found it quite hard to understand, not due to the concepts but to the poor writing. It seems as though you wrote the words as they came to your mind, with no further revision.
I strongly recommend a rewriting.
Hi Bruce,
Thanks for your suggestion. Will do other revision now.
HI , I want to say that this post is awesome thanks for sharing this with us . Can you please tell something about python too
Hi Abhishek,
Thanks for the compliment. We are planning for the python tutorials series, will go live soon.
Happy learning.
You’re doing a brilliant work by writing out such posts, nobody can complain in understanding them.
However, I have a doubt when you mentioned selecting ‘k’ features out of ‘m’ features. Are the features not formed along columns? If so, then you have 5 features in every random tree.
Can you please explain?
Hi Ash,
Thanks for you compliment 🙂
Selecting ‘k’ features out of ‘m’ is basically selecting few columns for all the columns of the data we have. In the example I have taken 5 in all the random trees, In general this number will vary all the time.
Gotcha! Thanks for answering. So, how could we ensure that all the features are represented if they’re selected randomly?
That’s what random forest algorithm does.
Content is good, but not visible due to wrong colour scheme.
Hi Gunjan Dogra,
Thanks for your compliment 🙂 . We were glad for your suggestion, As for our brand, it’s hard for us to change the color scheme. Hope you understand.
Very Nice Bro.. Thanks for your articles. Learned a lot from those and by the way we are from same college 🙂
Hi Tarun Voora,
Thanks for your compliment 🙂
Thank you, this is one of this best site I ever visited, very good explanation easy to understand.
Just I want to know how random forest or decision tree create/set rules to get a prediction.
for example :
label features prediction
9 [9.0,3.0,1.2274] 8.753939659
6 [6.0,3.0,1.2274] 6.392596619
6 [6.0,2.84,0.89] 6.114090909
5 [5.0,3.0,1.2274] 5.902340209
8 [8.0,2.775,1.2593] 7.758650794
please c an you explain me how it will go from root node respective this example it will be more helpful.
Thank you in advance.
Hi Rahul,
Thanks a lot for your compliment 🙂
You can have a look at how to visualize the trained decision tree article. In this article, we have explained how the trained decision tree classifier uses the trained rules to make the prediction.
Hi, Saimadhu Polamuri,
Thank you, u really explained it well.
i would like to learn about How random forest classifier works for classification. Have you already published this article ?
Hi Calude,
Thanks for your compliment. We published an article explaining how to use the random forest classifier to perform classification task. Do have a look. https://dataaspirant.com/2017/06/26/random-forest-classifier-python-scikit-learn/
Thank you for your answer and I am sorry. I would like talking about how random forest works for the regression.
Thank
Hi Calude,
You can use the scikit learn RandomForestRegressor to model regression models. Will try to write an article on how to use the random forest algorithm for regression models.
Hi Saimadhu, U really explained it well….I just want to understand the backhand process of the decision tree. Where can I find that?
Hi Anagha Tamhankar,
Thanks for your compliment 🙂
We have written an article on how the decision tree algorithm works. You can have a look.
https://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/
Well Explained concepts. Thanks for your time in writing the Article.
Hi Jisha,
Thanks for your compliment. 🙂
Random forest explained well. Great Job!
Hi Marcian Fernando,
Thanks for your compliment. 🙂
Hi, can you please tell me when random forest buulds tree , does it pick up only random attributes or also random tuples ?
Hi Mamta,
The random forest tree build considers random features and random observations to build each tree.
Thanks and happy learning!