How Gradient Boosting Algorithm Works

November 23, 2020 Shaik Sameeruddin

How Gradient Boosting Algorithm Works

Gradient boosting machines are a family of powerful boosting machine learning algorithms with various practical applications that have demonstrated tremendous success in the form of accuracy.

Which can be tailored to the application's unique needs since they learned about different loss functions in a better way.

The loss function is an indicator of the model’s stability when the underlying data of the model is updated. A rational interpretation of the loss function is what we strive to minimize the error.

For instance, if we try to predict sales prices with a regression. Then the loss function is based on a mistake between actual and expected domestic prices.

Likewise, the loss function will calculate how strong our predictive trend is in detecting bad credit defaults.

Gradient boosting is a naive algorithm that can easily bypass a training data collection. The regulatory methods that penalize different parts of the algorithm will benefit from increasing the algorithm's efficiency by minimizing over fitness.

In way it handles the model overfitting.

Learn how the gradient boosting algorithm works #machinelearning #datascience #classification #regression #gradientboosting

Click to Tweet

So eager to learn about gradient boosting algorithms ?

Great, this article offers a beginner’s tutorial on the gradient boosting approach. Descriptive examples and diagrams covering all the stages of gradient boosting model designs help in understanding the theoretical details.

We are also going to discuss the difficulty of the model in getting the optimized final model. Even though we are going to cover this in detail in the article. We just want to give an overview of the functionalities the algorithm performs to get the best-optimized model.

In other words, these functionalities help in reducing the difficulty and achieves the optimized model. Below are the listed functionalities.

An optimized loss function.
A bad indicator learner.
A adaptive model.

Along with the above mentioned functionalities, we are also going to study various gradient boosting applications.

Before we drive further. Let’s look at the list of concepts you will learn in this article.

Introduction to Boosting

Almost everyone in the field of machine learning will learn about the functionalities of gradient boosting. Many data scientists and analytical professionals regressively use this algorithm in their data science projects because of the predominant results it produces on various data science problems.

In addition, XGBoost is also the traditional algorithm for winning machine learning competitions on sites like kaggle, which is a variant of a gradient boosting machine.

In online competitions, XGBoost treat as the gold mine algorithm.

Also, boosting is an essential component of many of the recommended systems.

Now comes the real question.

What is boosting?

Let's address this question with more details.

What Is Boosting?

What is Boosting

Boosting is the ensemble learning method where we build multiple weak learners (same algorithms) in a parallel manner.

All these weak learners take the previous models’ feedback to improve their power in accurately predicting the miss classified classes. At the end, the algorithm uses all these weak learners to build the final model.

The final model predictions use the majority voting approach for classification problems. Whereas for regression kind of problem. The final value is the average value of all the week learning models.

If you know about how the randomforest algorithm works, this majority voting concept is same as that.

Now the other question you may get.

Can a strong model be accomplished from many models that are relatively poor and simply also called as weak learners?

In other words, can the multiple weak learners form a strong model to predict the future or test dataset?

By saying "weak models". We're not saying about simple basic models like decision trees.

But models with low-performance accuracy, where low is a bit better than random.

How can we build such models then?

A positive mathematical response to this question took a few years to create a fully-functional, algorithms based solution.

In short, this algorithm works in a few steps in a greedy approach.

At first, they construct a linear combination of simple models (basic algorithms) by re-weighting input data. The model (usually the decision tree) assigns larger weights for the incorrectly predicted items.

Going forward, the algorithm amplifies the incorrectly predicted classes and tries to predict them accurately. The loss function intends to minimize the errors for the incorrect classes than the overall dataset.

For the above explanation, you can refer to the below image in the AdaBoost section.

Adaboost

Many machine learning courses say AdaBoost, the ancestor of GBM (Gradient Boosting Machine).

However, after AdaBoost merged with gradient boosting machine. It has become evident that AdaBoost is just a different variant of GBM.

The algorithm itself has a very straightforward visual understanding and intuition to describe weights.

Let's take a look at the following toy classification problem where we're going to break the data between trees at depth 1 (also known as 'stumps') for each iteration of AdaBoost.

We have the following image for the first three iterations:

Increase weightage for the misclassified samples

For incorrect estimation, the size of the point is the weight assigned to it. We can see that weights are rising on each iteration – stumps cannot cope.

For creating the final model, we will combine all these weak learning to form the strong model. Below is the representation for the final model.

Adaboost Final Model Representation

Adaboost Pseudocode

Below is the pseudocode for the whole AdaBoost process.

Adaboost pseudocode

To understand it better sharing with you the visual representation of the weight changes.

Here in each iteration, we can see an increase in weights. Particularly at the border between classes. Once all the iterations are completed, We will get the best model that differences the target classes in a meaningful way.

Limitations with AdaBoost

AdaBoost works well but lacks clarification as to why the algorithm has successfully planted the seeds of doubt. Some thought it was a super-algorithm, a magic bullet, but others thought AdaBoost was overfitting.

The problem did indeed exist, mainly when the data had strong outliers. Therefore, in these kinds of issues, AdaBoost was unreliable.

Luckily, a few professors in the Stanford Department of Statistics, who developed Lasso, Elastic Net, and Random Forest, began studying the algorithm.

In 1999, Jerome Friedman came up with a generalization of boosting algorithms-Gradient Boosting (Machine), also known as GBM.

With this work, Friedman laid the statistical foundation for several algorithms that include a general approach to improving functional space optimization.

CART, bootstrap, and several other algorithms have come from Stanford's statistics department. In doing so, the Department has solidified its names in future textbooks.

These algorithms are very realistic, and some recent works have not yet widely adopted.

For example, check out the research paper Learning interactions through hierarchical group-lasso regularization.

Not many of Friedman's video recordings are available. While there is a fascinating interview with him about the development of CART and how they solved statistical problems similar to data analysis and data science today) more than 40 years ago.

There is also an excellent lecture from Hastie, a retrospective data analysis by one of the pioneers of approaches that we use every day.

The below image explains the various boosting algorithms cam fro this Stanford’s statistics department.

Boosting Algorithms Comparison

History of Gradient Boosting Machine (GBM)

It took more than ten years after the launch of the gradient boosting machine (GBM) to become an integral part of the data science toolbox.

Models like GLMboost and GAMboost reinforce current GAM, CoxBoost, and RankBoost and LambdaMART are available in the ranking.

There have been several gradients boosting machine (GBM) achievements on many platforms, including.

Stochastic gradient machine
Gradient Enhanced Decision Trees(GBDT)
Gradient Boosted Regression Trees (GBRT)
Multiple Additive Regression Trees (MART)

The ML population was also very divided and dissociated, making it hard to monitor the spread of development.

At the same time, boosting was actively used in the search ranking. For better results, the ranking approach rewritten in terms of a loss function that penalizes errors in the output order. So it has become easy to inject it into GBM.

AltaVista was one of the first businesses to implement a rating boost. Soon, ideas spread to Yahoo, Yandex, Bing, and so on. When this happened, boosting became one of the fundamental algorithms used in research and industry core technologies.

Machine learning competitions, particularly Kaggle, played a crucial role in boosting popularisation. Now, researchers had a shared forum where they could compete with a large number of participants from around the world on various data science issues.

With Kaggle, new algorithms could be evaluated on real data. Which offer "shine" opportunity algorithms, and provide full details on sharing model performance results across competition data sets.

After its appearance, the XGBoost Library quickly gained popularity. XGBoost is not a modern, special algorithm. It is an extremely successful implementation of classic GBM with additional heuristics.

Today, this algorithm has gone through a very typical machine learning algorithms journey: mathematical problems and algorithmic crafts to successful practical implementations and mass adoption years after its first appearance.

Mathematical understanding of Gradient Boosting Machine (GBM) Algorithm

Let's start with the issue we are about to solve with GBM.

In an overall supervised learning environment, we can solve the problem of function approximation.

We have a range of attributes x and goals, which is nothing but the target we are going to predict. {(xi, Yi)}i=1,…,n, we use to restore the y = f(x) dependency.

Where xi is the features and Yi is the respective target.

We restore dependency by approximating f^(x) and knowing the approximation is stronger by using the L(y, f) loss function.

To get the loss, we are going to pass both the f(x) output and the actual target value Y.

Which returns the loss of using this function (f(x)). Any supervised learning algorithm aims to reduce this loss value.

At the end, it’s all about getting a function to represent the input data. In simple words, we need to get the target value (Y) given the input (X) using the best function, which yields with less error/loss.

The below image showcase how various algorithms classifying the input data. You can observe how the same data points are classified differently with respect to the algorithm we considered.

Machine learning functions

By understanding the above image, it will help us a lot on solving any kind of problem of function approximation in general supervised learning.

Also, finding a better function’s which is suitable for solving any kind of problems and also it will help like a cheat sheet.

Right now, we make no assumptions about the model of our approximation for the form of dependence f^(x) or the distribution of the target variable (y).

The L(y,f) function is only supposed to be differentiable. This helps to understand how the loss function value changes with respect to the change in the algorithm’s parameters.

We express ourselves as follows to mitigate data loss:

The number of functions f(x) is unfortunately not only high, but their function space is infinite. This is why we can restrict the search area by a certain feature family f(x,θ),θ∈Rd. It is appropriate.

This dramatically simplifies the goal, as we now have a solvent parameter values optimization:

The parameters are typically approximated iteratively by simply analytical solutions to find optimum parameters θ^ often lack.

In the beginning, we join the empirical loss function Lθ(θ^), which helps us to assess our parameters with our knowledge. Let us also write down our approximation θ^ for many M iterations as a sum:

The only remaining thing is to find an effective iterative algorithm to reduceLθ(θ^). The simplest and most widely used option is the gradient descent as ∇Lθ(θ^).

Our final step is to start our first approach θi^ and to choose the number of iters M. Our final step is to start our first approach θ^ and to choose the number of iters M.

Let us discuss the steps for approximating this inefficient and naive algorithm to the θ^:

Gradient boosting pseudocode

Functional Gradient Descent

Imagine for a second that the function space will optimize and that we can look for approximations f^(x) as functions on an iterative basis.

As a description of incremental changes, we will express our approximation, each being a function. For convenience, the sum of the initial approximation f0^(x) is immediately commenced:

We just agreed that we are searching for our solution. Nothing has happened yet;

f^(x) not as a broad model with lots of parameters, but as a description of functions that pretend to pass into functional space. f^(x)

To complete this mission, we need to restrict the search to certain functions of the f^(x)=h(x, k). There are a few problems here — first, the sum of the models is complicated more than any family model; second, Note that we will have to pick an optimal − ρ∈ for step to the coefficient for every step; the problem is this:

This is where magic exists. All of our aims are described as having trained any kind of model h(x,θ) for any kind of loss function L(y,f(x,θ)) in general terms.

This is very complicated in practice, but luckily there is a convenient way for this task resolve.

We can measure its value on our data if we know the expression of the loss function gradient. Let's train the models so that our forecasts are similar to this gradient (with a lesser sign).

In other words, we use the least quadratic squares to correct these residual predictions. We minimize the square difference between pseudo-residuals r and our estimates for regression, and classification tasks.

You can relate, minimising the loss with square the R-squared and Adjusted R-squared.

The last issue for phase t is as follows:

Algorithms Pseudocode

How Gradient Boosting Works

Gradient Boosting Key Components

Gradient boosting approach can use for both the regression and classification problems. These teaching techniques generate a prediction model in the form of a series of weak prediction models, usually in the way of decision trees.

Three components are involved in gradient boosting:

An optimized loss function.
A lousy indicator learner.
An additive model that reduces failure cases.

Loss Function

The function of loss depends on the type of problem we are going to resolve.

It has to be differentiable. However, you should describe your own loss functions and help them.

Regression may use a squared error, for instance. In contrast, the classification may require a logarithmic loss.

The benefit of the gradient booster framework is that for each loss function, you may decide to use, the new booster algorithm does not need to extract.

Instead, the framework is general enough to enable any differentiable loss function.

Weak Learner

Decision trees are used in gradient boosting as weak learners.

Regression arborescences split values and add the output to them together to allow for the inclusion of results of the next models and to "right" the residuals in the predictions.

Trees designed greedily, selecting the best split points based on purity scores like Gini or minimizing loss.

Initially, as in AdaBoost's case. Very short decision trees were used, which had only one break, called the decision stump.

Larger trees can typically be uses 4-to-8 stages. In particular, the number of layers, nodes, splits, or leaf nodes reduces.

Additive Model

Trees are introduced one at a time. The current trees in the model are not updated. A gradient descent technique minimizes losses when adding trees.

Traditionally, gradient descent minimizes the number of parameters, such as the regression equation coefficients or the neural network weights.

After measuring the error or loss, the values are updated to minimize the error.

Instead of parameters, we have poor learning sub-models or, more precisely, decision trees. After calculating the loss, we have to add a tree to the model that decreases the loss (i.e., follows the gradient) to perform the gradient descent process.

This is achieved by parameterizing the tree. Changing the tree parameters and heading in the right direction (reducing the residual loss).

This method is commonly referred to as functional gradient descent or gradient descent with functions.

Implementing Gradient Boosting in Python

We learned about gradient boosting. Now comes the fun part, implementing these in python.

For implementation, we are going to build two gradient boosting models.

The first one is using the gradient boosting algorithm to solve the regression kind of problems. In contrast, the other one is for solving classification problems.

How to Use Gradient Boosting Classifier implementation

For grading problems in this section, we shall consider using Gradient boosting.

Creating classification dataset with make_classification

Second, we can construct a synthetic binary-classification problem with 1000 input examples and 20 features using make classification().

Next, using this dataset, we are going to build the boosted gradient algorithm.

Building Gradient Boosting Classifier

With repeated k-fold validation, we are going to test the model with three repetitions and 10 folds.

We report on all repeats and folds the mean and standard deviation from the accuracy of the formula.

Even we are having various measures for calculating performance of the models, In this case we have used only the accuracy.

The example indicates the mean and standard deviation of the model.

Note: The results can differ depending on the stochastic nature and precision of the algorithm or assessment process. Take the example several times and compare the average result.

In this case, the set Gradient Boosting with default hyperparameters achieves an accuracy of approximately 89.9% on the test dataset.

Performing prediction with a classification model

First, all the available data is assisted by the Gradient Boosting ensemble, and then predict() can be named to predict new data.

This is seen in our binary classification dataset by the example below.

The example is used for the whole dataset to predict a new row of data. The predicted class is 1. Likewise, you predict for the total test data also.

We learned how to implement the gradient boosting with sklearn.

Now Let's take a look at the implementation of regression using the gradient boosting algorithm.

Gradient Boosting Regressor implementation

In this section, we'll search for a regression problem by using Gradient Boosting.

Creating regression dataset with make_regression

First, we can use the make_regression() function to construct a 1000 examples, and 20 entry features synthetic regression problem.

Below is an example.

Next, using this dataset. We are going to build the boosted gradient algorithm model for the regression problem.

Building Gradient Boosting Regressor

We will test the model with three repeated k-fold cross-validation and 10 folds, as we have done with the last segment.

We report on all repeats and folds the mean absolute mistake (MAE) of the model. The science kit learning library negatively reduces MAE to a limit instead of a reduction.

These results are in better negative MAEs and in a perfect model with an MAE of 0. Below is the full example.

This example indicates the mean and standard deviation of the model.

Note: The results can differ depending on the stochastic nature and precision of the algorithm or assessment process. Take the example several times and compare the average result.

In this case, we can see the Default Hyperparameter Gradient Boosting ensemble achieves an MAE of approximately 62.

If we perform proper hyper parameter tuning then we would achieve the even better MAE.

Performing prediction with the regression model

As a final model, we can make predictions for regression using the Gradient Booster model.

First, all the data available is assisted by the Gradient Boosting ensemble, and then predict() can be named to predict new data.

We can see this is in our regression dataset in the example below.

This example is used for the whole dataset to predict a new row of data.

Complete Code

You can find the complete code explained in this article in our Github account. Feel free to have a look at all the dataaspirant codes in our Github Repo.

Conclusion

Today, we've discovered the principle behind gradient boosting algorithm. GBM is not just a particular algorithm but a standard technique for building model sets.

Besides, this approach is sufficiently versatile and expandable. A large number of models are trained, taking into account various loss functions with a range of weighting functions.

Gradient boosting models are efficient for both classification and regression algorithms, extremely complex data sets.

Gradient boosting models can do very well, but it is also susceptible to overfitting, which was compared with a variety of the above methods. Using the Scikit-Learn gradient boosters makes our job so easy.