How CatBoost Algorithm Works In Machine Learning
CatBoost algorithm is the first Russian machine learning algorithm developed to be open source. The algorithm was developed in the year 2017 by machine learning researchers and engineers at Yandex (a technology company).
The intention is to serve multi-functional purposes such as
- Recommendation systems,
- Personal assistants,
- Self-driving cars,
- Weather prediction, and many other tasks.
CatBoost algorithm is another member of the gradient boosting technique on decision trees.
Learn the popular CatBoost algorithm in machine learning, along with the implementation. #machinelearning #datascience #catboost #classification #regression #python
One of the many unique features that the CatBoost algorithm offers is the integration to work with diverse data types to solve a wide range of data problems faced by numerous businesses.
Not just that, but CatBoost also offers accuracy just like the other algorithm in the tree family.
Before we get started, let’s have a look at the topics you are going to learn in this article.
What is CatBoost Algorithm?
The term CatBoost is an acronym that stands for "Category” and “Boosting.” Does this mean the “Category’ in CatBoost means it only works for categorical features?
The answer is, “No.”
According to the CatBoost documentation, CatBoost supports numerical, categorical, and text features but has a good handling technique for categorical data.
The CatBoost algorithm has quite a number of parameters to tune the features in the processing stage.
"Boosting" in CatBoost refers to the gradient boosting machine learning. Gradient boosting is a machine learning technique for regression and classification problems.
Which produces a prediction model in an ensemble of weak prediction models, typically decision trees.
Gradient boosting is a robust machine learning algorithm that performs well when used to provide solutions to different types of business problems such as
- Fraud detection,
- Recommendation system,
- Forecasting.
Again, it can return an outstanding result with relatively fewer data. Unlike other machine learning algorithms that only perform well after learning from extensive data.
We would suggest you read the article How the gradient boosting algorithms works if you want to learn more about the gradient boosting algorithms functionality.
Features of CatBoost
Here we would look at the various features the CatBoost algorithm offers and why it stands out.
Robust
CatBoost can improve the performance of the model while reducing overfitting and the time spent on tuning.
CatBoost has several parameters to tune. Still, it reduces the need for extensive hyper-parameter tuning because the default parameters produce a great result.
Overfitting is a common problem in gradient boosting, especially when the dataset is small or noisy. CatBoost has several features that help reduce overfitting.
One of them is a novel gradient-based regularization technique called ordered boosting, which penalizes complex models that overfit the data. Another feature is the use of per-iteration learning rate, which allows the model to adapt to the complexity of the problem at each iteration.
Automatic Handling of Missing Values
Missing values are a common problem in real-world datasets. Traditional gradient boosting frameworks require imputing missing values before training the model. CatBoost, however, can handle missing values automatically.
During training, it learns the optimal direction to move along the gradient for each missing value, based on the patterns in the data.
Accuracy
The CatBoost algorithm is a high performance and greedy novel gradient boosting implementation.
Hence, CatBoost (when implemented well) either leads or ties in competitions with standard benchmarks.
Categorical Features Support
The key features of CatBoost is one of the significant reasons why it was selected by many boosting algorithms such as LightGBM, XGBoost algorithm ..etc
With other machine learning algorithms. After preprocessing and cleaning your data, the data has to be converted into numerical features so that the machine can understand and make predictions.
This is same like, for any text related models we convert the text data into to numerical data it is know as word embedding techniques.
This process of encoding or conversion is time-consuming. CatBoost supports working with non-numeric factors, and this saves some time plus improves your training results.
Easy Implementation
CatBoost offers easy-to-use interfaces. The CatBoost algorithm can be used in Python with scikit-learn, R, and command-line interfaces.
Fast and scalable GPU version: the researchers and machine learning engineers designed CatBoost at Yandex to work on data sets as large as tens of thousands of objects without lagging.
Training your model on GPU gives a better speedup when compared to training the model on CPU.
To crown this improvement, the larger the dataset is, the more significant the speedup. CatBoost efficiently supports multi-card configuration. So, for large datasets, use a multi-card configuration.
Faster Training & Predictions
Before the improvement of servers, the maximum number of GPUs per server is 8 GPUs. Some data sets are more extensive than that, but CatBoost uses distributed GPUs.
This feature enables CatBoost to learn faster and make predictions 13-16 times faster than other algorithms.
Interpretability
CatBoost provides some level of interpretability. It can output feature importance scores, which can help understand which features are most relevant for the prediction.
It also supports visualization of decision trees, which can help understand the structure of the model.
Supporting Community of Users
The non-availability of a team to contact when you encounter issues with a product you consume can be very annoying. This is not the case for CatBoost.
CatBoost has a growing community where the developers lookout for feedbacks and contributions.
There is a Slack community, a Telegram channel (with English and Russian versions), and Stack Overflow support. If you ever discover a bug, there is a page via GitHub for bug reports.
Is tuning Required For CatBoost Algorithm?
The answer is not straightforward because of the type and features of the dataset. The default settings of the parameters in CatBoost would do a good job.
CatBoost produces good results without extensive hyper-parameter tuning. However, some important parameters can be tuned in CatBoost to get a better result.
These features are easy to tune and are well-explained in the CatBoost documentation. Here are some of the parameters that can be optimized for a better result;
- cat_ features,
- one_hot_max_size,
- learning_rate & n_estimators,
- max_depth,
- subsample,
- colsample_bylevel,
- colsample_bytree,
- colsample_bynode,
- l2_leaf_reg,
- random_strength.
CatBoost vs. LightGBM vs. XGBoost Comparison
These three popular machine learning algorithms are based on gradient boosting techniques. Hence, a greedy and very powerful.
Several Kagglers have won a Kaggle competition using one of these accuracy-based algorithms.
Before we dive into the several differences that these algorithms possess, it should be noted that the CatBoost algorithm does not require the conversion of the data set to any specific format. Precisely numerical format, unlike XGBoost and Light GBM.
The oldest of these three algorithms is the XGBoost algorithm. It was introduced sometime in March 2014 by Tianqi Chen, and the model became famous in 2016.
Microsoft introduced lightGBM in January 2017. Then Yandex open sources the CatBoost algorithm later in April 2017.
The algorithms differ from one another in implementing the boosted trees algorithm and their technical compatibilities and limitations.
XGBoost was the first to improve GBM's training time. Followed by LightGBM and CatBoost, each with its techniques mostly related to the splitting mechanism.
Now we would go through a comparison of the three models using some characteristics.
Split
The split function is a useful technique, and there are different ways of splitting features for these three machine learning algorithms.
One right way of splitting features during the processing phase is to inspect the characteristics of the column.
lightGBM uses the histogram-based split finding and utilizes a gradient-based one-side sampling (GOSS) that reduces complexity through gradients.
Small gradients are well trained, which means small training errors, and large gradients are undertrained.
In Light GBM, for GOSS to perform well and to reduce complexity, the focus is on instances with large gradients. While a random sampling technique is implemented on instances with small gradients.
The CatBoost algorithm introduced a unique system called Minimal Variance Sampling (MVS), which is a weighted sampling version of the widely used approach to regularization of boosting models, Stochastic Gradient Boosting.
Also, Minimal Variance Sampling (MVS) is the new default option for subsampling in CatBoost.
With this technique, the number of examples needed for each iteration of boosting decreases, and the quality of the model improves significantly compared to the other gradient boosting models.
The features for each boosting tree are sampled in a way that maximizes the accuracy of split scoring.
In contrast to the two algorithms discussed above, XGBoost does not utilize any weighted sampling techniques.
This is the reason why the splitting process is slower compared to the GOSS of LightGBM and MVS of CatBoost.
Leaf Growth
A significant change in the implementation of the gradient boosting algorithms such as XGBoost, LightGBM, CatBoost, is the method of tree construction, also called leaf growth.
The CatBoost algorithm grows a balanced tree. In the tree structure, the feature-split pair is performed to choose a leaf.
The split with the smallest penalty is selected for all the level's nodes according to the penalty function. This method is repeated level by level until the leaves match the depth of the tree.
By default, CatBoost uses symmetric trees ten times faster and gives better quality than non-symmetric trees.
However, in some cases, other tree growing strategies (Lossguide, Depthwise) can provide better results than growing symmetric trees.
The parameters that change the tree growing policy include
- --grow-policy,
- --min-data-in-leaf,
- --max-leaves.
LightGBM grows the tree leaf-wise (best-first) tree growth. The leaf-wise growth finds the leaves that minimize the loss and split just those leaves without touching the rest (leaves that maximize the loss), allowing an imbalanced tree structure.
The leaf-wise growth strategy seems to be an excellent method to achieve a lower loss. This is because it does not grow level-wise, but it often results in overfitting when the data set is small.
However, this strategy's greed with LightGBM can be regularized using these parameters
- -–num_leaves,
- --min_data_in_leaf,
- --max_depth.
XGBoost also uses the leaf-wise strategy, just like the LightGBM algorithm. The leaf-wise approach is a good choice for large datasets, which is one reason why XGBoost performs well.
In XGBoost, the parameter that handles the splits process to reduce overfit is
- --max_depth.
Missing Values Handling
CatBoost supports three modes for processing
- missing values,
- "Forbidden,”
- "Min," and "Max.”
For "Forbidden,” CatBoost treats missing values as not supported.
The presence of the missing values is interpreted as errors. For "Min,” missing values are processed as the minimum value for a feature.
With this method, the split that separates missing values from all other values is considered when selecting splits.
"Max" works just the same as "Min,” but the difference is the change from minimum to maximum values.
The method of handling missing values for LightGBM and XGBoost is similar. The missing values will be allocated to the side that reduces the loss in each split.
Categorical Features Handling
CatBoost uses one-hot encoding for handling categorical features. By default, CatBoost uses one-hot encoding for categorical features with a small number of different values in most modes.
The number of categories for one-hot encoding can be controlled by the one_hot_max_size parameter in Python and R.
On the other hand, the CatBoost algorithm categorical encoding is known to make the model slower.
However, the engineers at Yandex have in the documentation stated that one-hot encoding should not be used during pre-processing because it affects the model’s speed.
LightGBM uses integer-encoding for handling the categorical features. This method has been found to perform better than one-hot encoding.
The categorical features must be encoded to non-negative integers (an integer that is either positive or zero).
The parameter that refers to handling categorical features in LightGBM is categorical_features.
XGBoost was not engineered to handle categorical features. The algorithm supports only numerical features.
This, in turn, means that the encoding process would be done manually by the user.
Some manual methods of encoding include label encoding, mean encoding, and one-hot.
When and When Not to Use CatBoost
We have discussed all of the goods of the CatBoost algorithm without addressing the procedure for using it to achieve a better result.
In this section, we would look at when CatBoost is sufficient for our data, and when it is not.
When To Use CatBoost
Short training time on a robust data
Unlike some other machine learning algorithms, CatBoost performs well with a small data set.
However, it is advisable to be mindful of overfitting. A little tweak to the parameters might be needed here.
Working on a small data set
This is one of the significant strengths of the CatBoost algorithm. Suppose your data set has categorical features, and converting it to numerical format seems to be quite a lot of work.
In that case, you can capitalize on the strength of CatBoost to make the process of building your model easy.
When you are working on a categorical dataset
CatBoost is incredibly faster than many other machine learning algorithms. The splitting, tree structure, and training process are optimized to be faster on GPU and CPU.
Training on GPU is 40 times faster than on CPU, two times faster than LightGBM, and 20 times faster than XGBoost.
When To Not Use CatBoost
There are not many disadvantages of using CatBoost for whatever data set.
So far, the hassle why many do not consider using CatBoost is because of the slight difficulty in tuning the parameters to optimize the model for categorical features.
Practical Implementation of CatBoost Algorithm in Python
CatBoost Algorithm Overview in Python 3.x
Pipeline:
- Import the libraries/modules needed
- Import data
- Data cleaning and preprocessing
- Train-test split
- CatBoost training and prediction
- Model Evaluation
Before we build the cat boost model, Let's have
Feature | Description |
---|---|
PassengerId | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) |
survival | Survival (0 = No; 1 = Yes) |
name | name of the passenger |
sex | sex of passenger |
age | age of the passenger |
sibsp | Number of Siblings/Spouses Aboard |
parch | Number of Parents/Children Aboard |
ticket | Ticket Number |
fare | Passenger Fare (British pound) |
cabin | Passenger Cabin |
embarked | Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) |
Before we implement the CatBoost, we need to install the the catboost library.
- Command: pip install catboost
You can get the complete code in our Github account. For you reference we have included the notebook please scroll the complete IPython notebook.
Conclusion
In this article, we have discussed and shed light on the CatBoost algorithm.
The CatBoost algorithm is excellent and is also dominating as the algorithm is used by many because of the features it offers, most especially handling categorical features.
This article covered an introduction to the CatBoost algorithm, the unique features of CatBoost, the difference between CatBoost, LightGBM, and XGBoost.
Also, we covered the answer to if hyper-parameter tuning is required for CatBoost and an introduction to CatBoost in Python.
In summary, CatBoost is a powerful gradient boosting framework that can handle categorical features, missing values, and overfitting. It is fast, scalable, and provides some level of interpretability.
It has become a popular choice for many machine learning tasks and is used by companies such as Yandex, Uber, and Zillow.
Frequently Asked Questions (FAQs) On Catboost Algorithm
1. What is CatBoost?
CatBoost (Categorical Boosting) is a high-performance, open-source gradient boosting library based on decision trees, with a special focus on handling categorical data.
2. How Does CatBoost Handle Categorical Features?
CatBoost converts categorical values into numbers using various statistics on combinations of categorical features and their target values, without the need for explicit pre-processing like one-hot encoding.
3. What is Gradient Boosting and How is CatBoost Related to It?
Gradient Boosting is a machine learning technique that builds predictive models in a stage-wise fashion from an ensemble of weak learners, typically decision trees. CatBoost is a type of gradient boosting algorithm that introduces innovations in its approach, particularly for handling categorical data.
4. What Makes CatBoost Different from Other Gradient Boosting Libraries?
CatBoost has several distinguishing features including its handling of categorical data, symmetrical trees that help reduce overfitting, and an efficient and scalable implementation.
5. Can CatBoost Automatically Handle Missing Values?
Yes, CatBoost can automatically handle missing values in the dataset, often without a need for any special treatment like imputation.
6. What is Ordered Boosting in CatBoost?
Ordered Boosting is a technique used in CatBoost to combat overfitting, where it uses permutations to introduce randomness when constructing the trees.
7. Is CatBoost Faster than Other Boosting Algorithms?
CatBoost is designed to be competitive in speed with other state-of-the-art boosting algorithms, and its performance is highly dependent on the specific dataset and problem.
8. How Does CatBoost Achieve its High Performance?
CatBoost's performance comes from algorithmic enhancements such as ordered boosting, special handling of categorical features, and model shrinkage that help it to generalize better.
9. Can CatBoost be Used for Regression Problems?
Yes, CatBoost can be used for both classification and regression problems effectively.
10. What Programming Languages Does CatBoost Support?
CatBoost is primarily implemented in Python and R, and it also provides support for C++ and Java interfaces.
11. How Do You Tune CatBoost Parameters?
CatBoost parameters can be tuned using grid search, random search, or Bayesian optimization techniques, similar to tuning other machine learning models.
12. What are the Key Parameters in CatBoost?
Some of the key parameters include the learning rate, depth of the trees, the number of trees, and the L2 regularization coefficient, among others.
13. Does CatBoost Offer GPU Support?
Yes, CatBoost offers GPU support for faster model training, particularly beneficial for large datasets.
14. What Types of Problems is CatBoost Best Suited For?
CatBoost is particularly well-suited for datasets with a large number of categorical features and is applicable to a wide range of problems, from recommendation systems to predictive analytics in finance and healthcare.
15. How Scalable is CatBoost for Big Data?
CatBoost is designed with scalability in mind and can handle large datasets effectively, leveraging both multi-core CPU and GPU acceleration.
Recommended Machine Learning Courses
Machine Learning Course
Rating: 4.6/5
Deep Learning Course
Rating: 4.5/5
NLP Course
Rating: 4/5