How The Kaggle Winners Algorithm XGBoost Algorithm Works

November 16, 2020 Samuel Adebayo

How XGBoost Algorithm Works

The popularity of using the XGBoost algorithm intensively increased with its performance in various kaggle computations. It has been a gold mine for kaggle competition winners.

The kaggle avito challenge 1st place winner Owen Zhang said,

“When in doubt, just use XGBoost.”

Whereas Liberty mutual property challenge 1st place winner Qingchen wan said,

“I only used XGBoost.”

The above two statements are enough to know the level impact of using the XGBoost algorithm in kaggle.

With this popularity, people in the space of data science and machine learning started using this algorithm more extensively compared with other classification and regression algorithms.

Learn how the most popular Kaggle winners algorithm XGBoost works #datascience #machinelearning #classification #kaggle #xgboost

Click to Tweet

So XGBoost is part of every data scientist algorithms tool kit. If you are preparing for data science jobs, it’s worth learning this algorithm.

In this article, we are going to teach you everything you need to learn about the XGBoost algorithm. Before we drive further, let’s quickly have a look at the topics you are going to learn in this article.

Let’s begin with What exactly Xgboost means.

What is XGBoost?

XGBoost Algorithm

XGBoost is a supervised machine learning algorithm that stands for "Extreme Gradient Boosting." Which is known for its speed and performance. When we compared with other classification algorithms like decision tree algorithm, random forest kind of algorithms.

Tianqi Chen, and Carlos Guestrin, Ph.D. students at the University of Washington, the original authors of XGBoost. They shared the XGBoost machine learning project at the SIGKDD Conference in 2016. Ever since then; it has gotten a lot more contributions from developers from different parts of the world.

After the presentation, many machine learning enthusiasts have settled on the XGBoost algorithm as their first best option for machine learning projects, hackathons, and competitions.

XGBoost is a multifunctional open-source machine learning library that supports a wide variety of platforms ranging from

Python,
R,
Java,
Julia,
C++,
Scala.

It is known for its ideal execution, accuracy, and speed.

Tianqi Chen revealed that the XGBoost algorithm could build multiple times quicker than other machine learning classification and regression algorithms.

XGBoost is a troupe learning strategy and proficient executions of the Gradient Boosted Trees calculation. One of the many bewildering features behind the achievement of XGBoost is its versatility in all circumstances.

The system runs in an abundance of different occasions speedier than existing well-known calculations on a solitary machine and scales to billions of models in conveyed or memory confined settings. With enhanced memory utilization, the algorithm disseminates figuring in a similar structure.

This causes the calculation to learn quicker. The versatility of XGBoost is a result of a couple of critical systems and algorithmic headways.

The next few paragraphs will provide more and detailed insights into the power and features behind the XGBoost machine learning algorithm.

Key Features of XGBoost Algorithm

The XGBoost (Extreme Gradient Boosting) algorithm is an open-source distributed gradient boosting framework. The significant advantage of this algorithm is the speed and memory usage optimization.

XGBoost Algorithm Features

The objective of this library is to efficiently use the bulk of resources available to train the model. Here are some unique features behind how XGBoost works:

Speed and Performance: XGBoost is designed to be faster than the other ensemble algorithms. XGBoost was based on C++ and has AAPI integrated for C++, Python, R, Java, Scala, Julia.
Portability: The XGBoost algorithm runs on Windows, Linux, OS X operating systems, and on cloud computing platforms such as AWS, GCE, Azure.
Core Algorithm Parallelization: XGBoost works well due to the core algorithm parallelization feature that harnesses multi-core computers' computational power to prepare a considerable model to train large datasets.
Regularization: XGBoost provides an alternative to the effects on weights through L1 and L2 regularization. Regularization helps in forestalling overfitting.
Deficient data-friendly: XGBoost has features like one-hot encoding for managing missing data. XGBoost integrates a sparsely-mindful model to address the different deficiencies in the data.
Weighted quantile sketch: Generally, using quantile algorithms, tree-based algorithms are engineered to find the split structures in data of equal sizes but cannot handle weighted data. XGBoost can suitably handle weighted data.
Block structure for equal learning: In XGBoost, data arranged in memory units called blocks to reuse the data rather than registering it once more. This feature is useful for the parallelization of tree development.
Cache awareness: In XGBoost, non-constant memory access is needed to get the column record's inclination measurements. Subsequently, XGBoost was intended to utilize the equipment. This is finished by allotting interior cradles in each string, where the slope measurements can be put away.
Out-of-Core Computing: This element improves the accessible plate space and expands its utilization when dealing with enormous datasets that don't find a way into memory.
Hyperparameter Tuning: XGBoost also stands out when it comes to parameter tuning. It has parameters such as tree parameters, regularization, cross-validation, missing values, etc., to improve the model's performance on the dataset.

XGBoost is the extension computation of gradient boosted trees. In the next section, let’s learn more about Gradient boosted models, which helps in understanding the workflow of XGBoost.

Gradient Boosted Models (GBM’s)

Gradient Boosted Models (GBM's) are trees assembled consecutively, in an arrangement. There are many Boosting calculations, for example, AdaBoost, Gradient Boosting, and XGBoost.

To understand how XGBoost works, we must first understand the gradient boosting and gradient descent techniques.

Gradient Descent Technique

Gradient boosting re-defines boosting as a mathematical optimization problem where the goal is to minimize the model's loss function by adding weak learners using gradient descent.

Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. As gradient boosting is based on minimizing a loss function, it leverages different types of loss functions.

This helps, preferably resulting in a flexible technique used for classification and regression.

Basically, gradient boosting is a model that produces learners during the learning process (i.e., a tree added at a time without modifying the existing trees in the model). The gradient descent optimization process is the source of the commitment of the weak learner to the ensemble.

The algorithm contribution of each tree depends on minimizing the strong learner’s errors.

How XGBoost Works

How XGBoost Algorithm Works

The workflow for the XGBoost algorithm is similar to the gradient boosting. If you are not aware of how boosting ensemble works, Please read the difference between bagging and boosting ensemble learning methods article.

This helps in understanding the XGBoost algorithm in a much broader way.

Gradient boosting does not change the sample distribution as the weak learners train on the strong learner's remaining residual errors. Training on the residuals of the model is another way to give more importance to misclassified data.

Also, new weak learners are added to focus on the zones where the current learners perform ineffectively. Each weak learner's contribution to the final prediction is based on a gradient optimization process to minimize the strong learner's overall error.

Gradient descent, a cost work gauges how close the anticipated qualities are to the relating real attributes. Preferably, we need as meager distinction as conceivable between the features expected and the real qualities.

Along these lines, we need the cost capacity to be limited. The loads related to a prepared model cause it to foresee esteem near genuine quality. Along these lines, the better the loads connected to the model.

The more exact are the anticipated qualities, and the lower is the cost of work. With more records in the preparation set, the loads are found out and afterward refreshed.

Gradient descent is an iterative enhancement calculation. It is a strategy to limit a capacity having a few factors. Subsequently, Gradient Descent determines the cost of work. It first runs the model with introductory loads, and afterward looks to limit the cost work by refreshing the loads more than a few emphases.

A Loss Function To Be Optimized

The selected loss function relies on the sort of problem which can be solved, and it must be differentiable. However, the numerous standard loss functions are supported, and you can set your preference.

For instance, classification problems might work with logarithmic loss, while regression problems may use a squared error. These differences are well explained in the article difference between R-Squared and Adjusted R-Squared.

An advantage of the gradient boosting technique is that another boosting algorithm does not need to be determined for every loss function that might need to be utilized.

All things considered, it is a nonexclusive enough system that any differentiable loss function can be selected.

A weak learner to make predictions

In gradient boosting, decision trees serve as the weak learner. Regression trees that can be added together and output real values for splits are used; this permits resulting models outputs to be added and “correct” the residuals in the predictions.

The trees are developed greedily; selecting the best split points depends on purity scores like Gini or to minimize the loss. Same like the way Gini calculated in decision tree algorithms.

In AdaBoost, extremely short decision trees or one-level decision trees called a decision stump that has a single attribute for splitting was used. It is common to constrain the weak learners in specific ways, such as a maximum number of layers, nodes, splits, or leaf nodes. This is to guarantee that the learners stay weak but can still be constructed greedily.

An additive model to add weak learners to minimize the loss function

While trees are added in turns, the existing trees in the model do not change. A gradient descent technique is used to minimize the loss function when adding trees.

Basically, gradient descent reduces a set of parameters, such as the coefficients in a regression equation or weights in a neural network. After estimating the loss or error, the weights are refreshed to limit that error.

Rather than parameters, it is decision trees, also termed weak learner sub-models. Subsequent to ascertaining the loss, we must add a tree to the model that reduces the loss (i.e., follow the gradient) to perform the gradient descent procedure.

We do this by parameterizing the tree, modifying the tree's parameters, and moving in the right direction by (reducing the residual loss).

XGBoost was engineered to push the constraint of computational resources for boosted trees. XGBoost is an implementation of GBM with significant upgrades. GBM's assemble trees successively, but XGBoost is parallelized.

This is a technique that makes XGBoost faster. XGBoost uses more accurate approximations by employing second-order gradients and advanced regularization like ridge regression technique.

XGBoost Summary

In short, XGBoost works with the concepts of boosting, where each model will build sequentially. Each model takes the previous model’s feedback and tries to have a laser view on the misclassification performed by the previous model.

This feedback of building sequential models happens in parallel. Which helps in getting the XGBoost the fast it needs.

When (NOT) To Use XGBoost

When to use XGBoost Algorithm

After learning so much about how XGBoost works, it is imperative to note that the algorithm is robust but best used based on specific criteria.

Before selecting XGBoost for your next supervised learning machine learning project or competition, you should consider noting when you should and should not use it.

When To Use XGBoost

You have a large number of training samples. The definition of large in this criterion varies. Generally, a dataset greater than 1000 training samples and a few features, maybe 100, is considered fair.
In practice, if the number of features in the training set is smaller than the number of training samples, XGBoost would work fine.
XGBoost works when you have a mixture of categorical and numeric features - Or just numeric features in the dataset. ‘

When To Not Use XGBoost

The XGBoost algorithm would not perform well when the dataset's problem is not suited for its features.
More precisely, XGBoost would not work with a dataset with issues such as Natural Language Processing (NLP), computer vision, image recognition, and understanding problems.
These datasets are best solved with deep learning techniques.
XGBoost should not be used when the size of the training dataset is small. If the training set is less than the number of features, XGBoost would not be efficient.

Hyper-Parameter Tuning In XGBoost

Hyper-parameter tuning is an essential feature in the XGBoost algorithm for improving the accuracy of the model. There are three different categories of parameters according to the XGBoost documentation.

General parameters
Booster parameters
Task parameters

General Parameters

These parameters guide the functionality of the model.

Booster Parameters

There are two types of boosters.

Tree booster
Linear booster

The booster parameters used would depend on the kind of booster selected. Tree boosters are mostly used because it performs better than the liner booster.

Task Parameters

These parameters are used based on the type of problem. Generally, the parameters are tuned to define the optimization objective.

The booster and task parameters are set to default by XGBoost. There is a bunch of parameters under these three categories for specific and vital purposes. Some of the most commonly used parameter tunings are

booster,
learning_rate,
gamma,
max_depth,
subsample,
colsample_bytree,
n_estimators,
tree_method,
lambda,
alpha,
objective.

Read the XGBoost documentation to learn more about the functions of the parameters.

XGBoost implementation in Python

To have a good understanding, the script is broken down into a simple format with easy to comprehend codes. The datasets for this tutorial are from the scikit-learn datasets library.

Follow these next few steps and get started with XGBoost.

Installing XGBoost Package

Before we use the XGBoost package, we need to install it. We have two ways to install the package

Installing in the anaconda environment
Installing in a python virtualenv environment

How to install XGBoost using Anaconda environment

Open the Anaconda prompt and type the below command.

conda install XGBoost

How to install XGBoost using python virtualenv environment

Inside you virtualenv type the below command.

pip install XGBoost

If you are not aware of creating environments for data science projects, please read the article, how to create anaconda and python virtualenv environment.

In this article, we are addressed which environment is best for data science projects and when we need to use what.

How to Use XGBoost for Classification Problem

For learning how to implement the XGBoost algorithm for classification kind of problems, we are going to use sklearn famous classification dataset Iris datasets.

Please scroll the above for getting all the code cells. The code is self-explanatory. We build the XGBoost classification model in 6 steps.

Import the libraries/modules needed

We imported the required python packages along with the XGBoost library.

Import data

We loaded the iris dataset from the sklearn model datasets.

Data cleaning and preprocessing

We performed the basic data preprocessing on the loaded dataset.

Train-test split

We splitted the data into train and test datasets.
Then we used hyperparameter tuning to get the best parameters to build the model.

XGBoost training and prediction

Using the best parameters, we build the classification model using the XGBoost package.

Model Evaluation

We evaluated the build classification model.

Now let’s learn how we can build a regression model with the XGBoost package.

How to Use XGBoost for Regression

For learning how to implement the XGBoost algorithm for regression kind of problems, we are going to build one with sklearn famous regression dataset boston horse price datasets.

Please scroll the above for getting all the code cells. The code is self-explanatory. We build the XGBoost regression model in 6 steps.

Import the libraries/modules needed
1. We imported the required python packages along with the XGBoost library.
Import data
1. We loaded the boston house price dataset from the sklearn model datasets.
Data cleaning and preprocessing
1. We haven’t performed any data preprocessing on the loaded dataset, just created features and target datasets.
Train-test split
1. We split the data into train and test datasets.
XGBoost training and prediction
1. Using the default parameters, we build the regression model using the XGBoost package.
Model Evaluation
1. We evaluated the build regression model.

Note: We build these models in google colab, but you can use any integrated development environment (IDE) of your choice.

Complete code

You get the complete codes used in this article; please visit our Github Repo created for this article. To fork all the dataaspirant code, please use this link.

Below we provided both classification and regression colab codes links.

Conclusion

This article has covered a quick overview of how XGBoost works.

XGBoost algorithm is widely used amongst data scientists and machine learning experts because of its enormous features, especially speed and accuracy.

Also, this article covered an overview of tree boosting, a snippet of XGBoost in python, and when to use the XGBoost algorithm.

XGBoost would not perform well for all types and sizes of data because the mathematical model behind it is not engineered for all types of dataset problems.

However, more sophisticated techniques such as deep learning are best fit for enormous problems beyond the XGBoost algorithm.