How Cross-Validation Works In Machine Learning

December 3, 2020 Anber Arif

Cross validation defined as:

“A statistical method or a resampling procedure used to evaluate the skill of machine learning models on a limited data sample.”

It is mostly used while building machine learning models. It compares and selects a model for a given predictive modeling problem, assesses the models’ predictive performance.

Later judges how they perform outside to a new data set, also known as test data.

The motivation to use cross validation techniques is that we are holding it to a training dataset when we fit a model.

Lean how the cross validation creates multiple datasets. #machinelearning #datascience #artificialintelligence #datascientists #regression #classification #crossvalidation #loocv #stratifiedcrossvalidation

Click to Tweet

In this tutorial, along with cross validation we will also have a soft focus on the k-fold cross-validation procedure for evaluating the performance of the machine learning models.

Toward the end of this instructional exercise, you will become more acquainted with the below topics:

Various types of cross-validation among which k-fold cross-validation is the most commonly used
K-fold cross-validation is a resampling procedure that estimates the skill of the machine learning model on new data.
Some common strategies that we can use to select the value of k for our dataset
Common variations in cross-validation such as stratified and repeated that are available in scikit-learn.

Before we start learning, Let’s have a look at the topics you will learn in this article. Only if you read the complete article 🙂

To learn the cross validation topic, you need to know about the overfitting and underfitting. No need to know how to handle overfitting but at least the issue.

Concept Of Model Underfitting & Overfitting

Whenever a statistical model or a machine learning algorithm captures the data’s noise, underfitting comes into play.

Intuitively, overfitting occurs when the machine learning algorithm or the model fits the data too well. Whenever overfitting occurs, the model gives a good performance and accuracy on the training data set but a low accuracy on new unseen data sets.

Contrary to that, whenever a statistical model or a machine learning algorithm cannot capture the data’s underlying trends, under-fitting comes into play.

Intuitively, under-fitting occurs when the the model does not fit the information well enough. It can be said that under-fitting is a consequence of a straightforward model.

The term “simple” means the underlying missing data of the model is not adequately handled. The irrelevant features that do not contribute much to the predictor variable are not removed.

How Can We Prevent Model Overfitting

In machine learning, a significant challenge with overfitting is that we are unaware of how our model will perform on the new data (test data) until we test it ourselves.

Generally we split our initial dataset into two subsets, i-e, training, and test subsets, to address this issue.

If we use a smart way to use the available initial dataset to multiple test datasets, we can overcome the issue of overfitting. As now our model learns on various train datasets.

This smart is nothing but cross validation. This makes more sense, when we explain how we can create multiple train datasets in the upcoming sections of this article.

Now comes the real question.

What Is Cross Validation?

Cross-validation is the best preventive measure against overfitting. It is a smart technique that allows us to utilize our data in a better way.

In the data mining models or machine learning models, separation of data into training and testing sets is an essential part. Minimizing the data discrepancies and better understanding of the machine learning model’s properties can be done using similar data for the training and testing subsets.

How about using the same data, if we create multiple training datasets ?

Where come cross validation.

Cross Validation Procedure

There are different types or variations of cross-validation, but the overall procedure remains the same.

The following is the procedure deployed in almost all types of cross-validation:

Partition or divide the dataset into several subsets
At one time, keep or hold out one of the set and train the model on the remaining set
Perform the model testing on the holdout dataset

The same procedure is repeated for each subset of the dataset.

Sometimes, the data splitting is done into training and validation/test sets when building a machine learning model. The main reason for the training set is to fit the model, and the purpose of the validation/test set is to validate/test it on new data that it has never seen before.

We can do a classic 80-20% split, but different values such as 70%-30% or 90%-10% can also be used depending on the dataset’s size.

It is challenging to evaluate and make changes in the model that outweigh our data. So we create two sections of our data as under

Training Data
Test Data

Training Data

Training data can be defined as:

“The data used to construct the model.”

Most of our data should be used as training data as it provides insight into the relationship between our given inputs.

Test Data

Test data can be defined as:

“The data used for model validation/testing.”

The models generated are to predict the results unknown, which is named as the test set.

We can use test data on our model to see how well our model performs on data it has never seen before.

Depending upon the performance of our model on our test data, we can make adjustments to our model, such as mentioned below:

Change the values of the hyper-parameters i-e; α and λ , also know as hyper parameter tuning.
Adjust the number of variables in the model
Change the number of layers in the neural network

Now we get a more refined definition of cross-validation, which is as:

“The process of using test data to assess the model’s performance is termed as cross-validation.”

Variations on Cross Validation

The commonly used variations on cross-validation are discussed below:

Train/Test Split

The train-test split evaluates the performance and the skill of the machine learning algorithms when they make predictions on the data not used for model training.

It is an easy and fast procedure to implement as the results allow us to compare our algorithms’ performance for the predictive modeling problem. Though the method is simple and easy to use, some scenarios do not work well.

For example, in small datasets and the situation in which additional configuration is needed, the method does not work well.

Leave One Out Cross Validation (LOOCV)

This variation on cross-validation leaves one data point out of the training data. For instance, if there are n data points in the original data sample, then the pieces used to train the model are n-1, and p points will be used as the validation set.

This cycle is repeated in all of the combinations where the original sample can be separated in such a way. After this, the mean of the error is taken for all trials to give overall effectiveness.

We consider that the number of possible combinations is equal to the number of data points in the original sample represented by n.

Stratified

The process of rearranging the data to ensure that each fold is a good representative of the whole is termed stratification.

For instance, in the case of a binary classification problem, each class is comprises of 50% of the data.

Let's say the ration is 30% and 70% distribution. So the best practice is to arrange the data so that each class consists of the same 30% and 70% distribution in every fold.

Note that 30% and 70% ration is not imbalanced data.

Repeated

In this method, the k-fold cross-validation method undergoes n number of repetitions.

More importantly, the data sample’s shuffling is done before each repetition, resulting in a different sample split.

Nested

In this method, the k-fold cross-validation is performed within each fold of cross-validation, Sometimes to perform tuning of the hyperparameters during the evaluation of the machine learning model.

This process is termed nested or double cross-validation.

Train – Test Split

This method usually split our data into the 80:20 ratio between the training and test data.

The technique works well enough when the amount of data is large, say when we have 1000+ rows of data.

1. When we are working with 100,000+ rows of data, the ratio of 90:10 can be of use, and with 1, 00,000+ data rows, we can use a 99:1 balance.

2. Generally, when working with a large amount of data. A smaller percentage of test data can be used since the amount of the training data is sufficient to build a reasonably accurate model.

Cost Function

Let’s have a look at the cost function or mean squared error of our test data. These kind of cost functions help in optimizing the errors the model made.

In the above formula, m_test shows the number of training examples in test data.

To assess the execution of our model, we can make adjustments accordingly. several evaluation metrics are there.

Such as,

Mean Square Error (MSE),
Root Mean Square Error (RMSE),
R-Squared and Adjusted R-Squared methods.

The above mentioned metrics are for regression kind of problems.

Below are the advantages and disadvantages of the Train – Test Split method.

Advantages

Train – Test Split works well with large data sets.
The computing power is low.
The feedback for model performance can be obtained quickly.

Disadvantages

Train – Test Split works very poorly on small data sets.
It often leads to the development of the models having high bias when working on small data sets.
There is a possibility of selecting test data with similar values, i-e, non-random values, resulting in an inaccurate evaluation of model performance.

K-Fold Cross Validation

In k-fold cross-validation, we do more than one split. We can do 3, 5, 10, or any K number of partitions.

These splits are called folds, and the method works well by splitting the data into folds, usually consisting of around 10-20% of the data.

Upon each iteration, we use different training folds to construct our model; therefore, the parameters which are produced in each model may differ slightly.

The model parameters generated in each case are also averaged to make a final model.

The k-fold procedure has a single parameter termed k, which depicts the number of groups the sample data can be split into.

For example, if we set the value k=5, the dataset will be divided into five equal parts.

Following the general cross-validation procedure, the process will run five times, each time with a different holdout set.

Below are the advantages and disadvantages of k-fold cross-validation.

Advantages

K-fold cross-validation works well on small and large data sets.
All of our data is used in testing our model, thus giving a fair, well-rounded evaluation metric.
K-fold cross-validation may lead to more accurate models since we are eventually utilizing our data to build our model.

Disadvantages

The computing power is high.
So it may take some time to get feedback on the model’s performance in the case of large data sets.
Slower feedback makes it take longer to find the optimal hyperparameters for the model.

Configuration Of K in Cross Validation

Now, let’s discuss how we can select the value of k for our data sample.

It must be noted that the value of k must be chosen carefully because a poorly chosen value for k may give a vague idea of the machine learning model’s skill.

Common tactics for choosing the value of k

The common strategies for choosing a value for k are as under

Representative:

We chose the value of k so that each train/test subset of the data sample is large enough to be a statistical representation of the broader dataset.

Using the value k=10:

In the field of applied machine learning, the most common value of k found through experimentation is k = 10, which generally results in a model skill estimate with low bias and a moderate variance.

Putting k=n:

In this strategy, the value for k is fixed to n, where n represents the dataset’s size to allow each test sample to be used in the holdout dataset. This approach is called leave-one-out cross-validation (LOOCV).

Deductions

Some essential deductions from the above strategies are as under:

We usually use the value of k, either 5 or 10, but there is no hard and fast rule. When we use a considerable value of k, the difference between the training and the resampling subsets gets smaller. The bias gets smaller as the difference decreases.
Most commonly, the value of k=10 is used in the field of applied machine learning. A bias-variance tradeoff exists with the choice of k in k-fold cross-validation. Given this scenario, k-fold cross-validation can be performed using either k = 5 or k = 10, as these two values do not suffer from high bias and high variance.
When we choose a value of k that does not perform even splitting of the data, then the remainder of examples will be found in one group. We prefer to split our data sample into k number of groups having the same number of samples.

Example Depicting The K-Fold Procedure

In order to have a concrete concept of k-fold cross-validation, let have a look at the following example depicting its procedure.

We consider that we have 6 observations as below:

Initially, the value of k is chosen to determine the number of folds required for splitting the data so that we will use a value of k=3.

That means that first, we will shuffle the data and then split the data into three groups. As we have six observations, so each group will have an equal number of 2 observations.

The use of the sample can be made to evaluate the machine learning model’s skill and performance.

Training and evaluation of three models are performed where each fold is allowed to be a held-out test set.

For example:

Model1: Training on fold1 and fold2, Testing on fold3
Model2: Training on fold2 and fold3, Testing on fold1
Model3: Training on fold1 and fold3, Testing on fold2

After the evaluation process ends, the models are discarded as their purpose has been served. The skill scores are then collected for each model and summarized for use.

Cross Validation API

The k-fold cross-validation process needs not to be implemented manually. Implementation is provided by the scikit-learn library, which performs the splitting of the given data sample.

We can use the KFold() scikit-learn class. It takes the number of splits as the arguments without taking into consideration whether the sampling of the data is done or not.

An instance can be created that will perform the splitting of the dataset into three folds, performs shuffling of the data sample before the split. Then uses a value of 1 for the pseudorandom number generator.

kfold = KFold(3, True, 1)

We can call the split() function on the class where the data sample is provided as an argument. The split() function will return each group of train and test sets on repeated calls.

In particular, the arrays containing the indexes are returned into the original data sample of observations to be further used for train and test sets on each iteration.

For example, the splits of the indices for the data sample can be enumerated using the created KFold instance, as shown below in the following code:

All of this can be tied together with the small dataset mentioned above in the worked example.

When we run the above example, specific observations chosen for each train and test set are printed.

On the original data array, the indices are used directly to retrieve the observation values.

In the scikit-learn library, the k-fold cross validation implementation is provided as a component operation with broader methods such as scoring a given data sample model.

In scikit-learn, the k-fold cross-validation is provided as a component operation within more general practices, such as achieving a model on a dataset.

However, we can use the K-Fold class directly for splitting up the dataset before modeling such that all of the machine learning models will use the same splits of the data. This technique is mostly helpful when we are working with large datasets.

Using the same partitions of data across algorithms can have a lot of benefits for statistical tests.

You can clone this article related codes in our Github repo.

Conclusion

So far, we have learned that a cross-validation is a powerful tool and a strong preventive measure against model overfitting.

With cross validation, we can better use our data and the excellent know-how of our algorithm’s performance. Furthermore, we had a look at variations of cross-validation like LOOCV, stratified, k-fold, and so on.

In complicated machine learning models, sometimes it becomes a bit easy not paying attention and using the same sample data in different pipeline stages.

The consequence is that it may lead to good but not a real performance in most cases as strange side effects may be introduced.

Cross-validation can be of great use while dealing with the non-trivial challenges in the Data Science projects.