How to Handle Overfitting With Regularization

January 14, 2021 Jithendra Yenugula

Overfitting and regularization are the most common terms which are heard in Machine learning and Statistics. Your model is said to be overfitting if it performs very well on the training data but fails to perform well on unseen data.

This is one of the most common and dangerous phenomena that occurs when training your machine learning models. There are many techniques that you can use to fix this problem. Regularization is one among them.

This post will help you understand what is regularization and how it helps in fixing the overfitting problem.

Are you excited to learn?

Great. Before we start, below are the list of topics you are going to learn in this article.

Learn the smart ways to handle overfitting with regularization techniques #datascience #machinelearning #linearregression

Click to Tweet

Understanding Overfitting in Machine learning

Overfitting occurs when the model is trying to learn the data too well. In other words, the model attempts to memorize the training dataset. This leads to capturing noise in the training data.

Learning such data points that are present by random chance and don't represent true properties of data makes the model more flexible.

Hence, the model performs too well on the training data but fails to perform well on unseen data.

Overfitting is like training your dog to lay down when you whistle to it, and it learns the trick perfectly after some practice. But it only lays down to only your whistle and not to others. It is because it is trained for your whistle and not someone else.

The same thing happens to your model when it is not trained correctly or if there is not enough variation in the training activity.

The model only works perfectly for the training dataset but fails to generalize for unseen data.

How to identify Overfitting?

Some of the parameters to look out for to identify if the model is overfitting or not are:

Train and test loss comparison
Prediction graph
Leveraging the cross-validation

Let’s understand these in detail.

Train and Test loss

Loss is the result of a bad prediction. A loss is a number indicating how bad the model's prediction was. In general, it is helpful to split the dataset into three parts

Training,
Validation,
Testing.

Validation is for checking the model’s performance before testing it for a completely new dataset. It is important to calculate losses during training and validation.

If the training loss is very low, but the validation loss is very high, then your model is overfitting.

We have different version on saying the same, which is bias and variance tradeoff.

Prediction Graph

If the data points and the fitted curve are plotted in a graph, and if the curve looks too complex, fitting all the data points perfectly, then your model is overfitting.

Techniques to fix Overfitting

There are many techniques that you can follow to prevent the overfitting of your model.

Train with more data

This is not always possible, but if the model is too complicated. Whereas the data available for training is small comparatively, then it is better to increase the size of the training data.

Don’t train with highly complex models

If you are training a very complex model for relatively less complex data, then the chances for overfitting are very high. Hence, it’s better to reduce the model complexity to prevent overfitting.

Cross-validation

Cross-validation is a powerful method to prevent overfitting. The idea of cross-validation is to divide our training dataset into multiple mini train-test splits. Each split is called as a fold.

We divide the train set into k folds, and the model is iteratively trained on k-1 folds, and the remaining 1 fold is used as a test fold. Based on the model performance on the test fold, we tune its hyperparameters.

This allows us to tune the hyperparameters with only our original training dataset without any involvement of the original test dataset.

You can know more about Cross-validation here on our article.

https://dataaspirant.com/cross-validation/

Remove unnecessary features

If there are too many features, then the model tends to overfit. So, it is better to remove unnecessary features from the dataset before training.

Regularization

As the word suggests, it is the process of regularizing the model parameters, which discourages learning a more complex model.

What is Regularization?

Before we discuss regularization, let's think about the below questions.

What if there is less data for training?
How do we avoid overfitting?

Let’s assume the data is having too many features, but how do we know which features to remove and which features to keep?

In any machine learning problem, the accuracy of a model is a very important factor. In binary classification models we check the performance of the model with confusion matrix.

For other classification models we are having various evaluation metrics.

We can't achieve more accuracy for a model without adding more features, but adding features results in overfitting the data.

Then how do we avoid overfitting?

This is where regularization comes in handy.

Regularization, as the name suggests, is the process of regularizing something. Regularization shrinks the parameters of the model to zero, which reduces its freedom.

Hence, the model will be less likely to fit the noise of training data and will improve the generalization ability of the model.

We penalize the cost function by adding a penalty that regularizes or shrinks the coefficient estimates to zero.

Let’s look at the cost function

where

hθ(x(i)) is the predicted value of some datapoint x(i)
y(i) is original

The penalized cost function looks like this

where-

λ is the tuning parameter that decides how much we want to penalize the flexibility of our model. It can be tuned using cross-validation.

So each time some parameter tries to become large, it will be penalized to a small value.

There are two kinds of regularization:

L1 Regularization

This adds a penalty equal to the L1 norm of the weights vector(sum of the absolute value of the coefficients). It will shrink some parameters to zero.

Hence some variables will not play any role in the model. L1 regression can be seen as a way to select features in a model.

L1 = L(X,y) + λ|θ|

L2 Regularization

This adds a penalty equal to the L2 norm of the weights vector(sum of the squared values of the coefficients). It will force the parameters to be relatively small.

L2 = L(X,y) + λθ2

Ridge and Lasso Regression

Two of the very powerful techniques that use the concept of L1 and L2 regularization are Lasso regression and Ridge regression.

These models are extremely helpful in the presence of a large number of features in the dataset.

Lasso Regression

Lasso regression is like linear regression, but it uses L1 regularization to shrink the cost function.
So, the minimized cost function is the original cost function with some penalty equivalent to the sum of the absolute values of the coefficients’ magnitude.

Ridge Regression

When the regression model uses L2 regularization, then it is termed as Ridge regression.
The minimized cost function is the original cost function with some penalty equivalent to the sum of the squares of the magnitude of the coefficients.

Look at our articles where we clearly explained Lasso and Ridge regressions:

Lasso Regression: https://dataaspirant.com/lasso-regression/
Ridge Regression: https://dataaspirant.com/ridge-regression/

Regularization implementation in python

Now let’s implement Regularization in Python. We are going to use this House Sales dataset. First, let’s import some necessary libraries and clean the dataset.

Now, we’ll check how well different regression models are working.

Linear Regression Implementation

We’ll be using the scikit-learn library for implementing our algorithms. To understand the library better, you can check out this page, which is the official documentation of scikit-learn.

Lasso Regression (L1) Implementation

We’ll use cross-validation to tune the hyperparameter - “alpha”

Ridge Regression (L2) Implementation

From the above, we can conclude that Ridge regression with alpha value 0.25 best fits the data. So, we use the ridge regressor with the selected value as the final model.

In this way, we could implement regularization with linear regression models.

Conclusion

In this blog post, we tried to understand what overfitting is and how to identify it. Overfitting is simply when a model performs very well on training data but fails to generalize the unseen data.

We also understood What is Regularization and how it prevents overfitting during training a model.

We learned two different types of linear regression techniques that use regularization. We also implemented them in Python.

Check out the following links which might help you to understand regularization.

Cross Validation - https://dataaspirant.com/cross-validation/
Lasso Regression - https://dataaspirant.com/lasso-regression/
Ridge Regression - https://dataaspirant.com/ridge-regression/
StatsQuest - https://www.youtube.com/watch?v=Q81RR3yKn30&t=1055s
Krish Naik - https://www.youtube.com/watch?v=9lRv01HDU0s
Andrew NG - https://www.youtube.com/watch?v=KvtGD37Rm5I

Frequently Asked Questions (FAQs) On Handle Overfitting With Regularization

1. What is Overfitting in Machine Learning?

Overfitting occurs when a machine learning model learns both the underlying pattern and the noise in the training data to such an extent that it negatively impacts the model's performance on new, unseen data.

2. How Does Regularization Help in Handling Overfitting?

Regularization techniques modify the learning algorithm to reduce model complexity, which helps in preventing the model from fitting too closely to the training data, thus reducing overfitting.

3. What are Common Regularization Techniques?

The most common regularization techniques include L1 (Lasso), L2 (Ridge), and Elastic Net regularization.

4. What is L1 Regularization (Lasso)?

L1 regularization (Least Absolute Shrinkage and Selection Operator) adds a penalty equivalent to the absolute value of the magnitude of coefficients. It can lead to some coefficients being exactly zero, effectively performing feature selection.

5. How Does L2 Regularization (Ridge) Work?

L2 regularization (Ridge) adds a penalty equal to the square of the magnitude of coefficients. Unlike L1, it doesn’t set coefficients to zero but shrinks them, thus reducing model complexity.

6. What is Elastic Net Regularization?

Elastic Net is a combination of both L1 and L2 regularization techniques. It is particularly useful when multiple correlated features are present in the dataset.

7. Can Regularization Be Used for All Types of Models?

Regularization is mostly used in linear models like linear regression and logistic regression. However, the concept can be extended to other algorithms, including neural networks.

8. How Do You Choose Between L1 and L2 Regularization?

The choice depends on the data and the problem. L1 can be used when feature selection is important, whereas L2 is generally preferred when all features are expected to be important or when you have more features than training instances.

9. What is the Role of the Regularization Parameter?

The regularization parameter (often denoted as lambda or alpha) controls the strength of the penalty. The optimal value is usually found through cross-validation.

10. Does Regularization Affect Model Bias and Variance?

Yes, regularization increases bias but reduces variance, aiming for a balanced trade-off for optimal model performance on new data.

11. How Does Regularization Work in Neural Networks?

In neural networks, regularization can be applied through techniques like dropout, early stopping, or adding penalties to the cost function, similar to L1 or L2.

12. Can Regularization be Used in Non-Linear Models?

Yes, while commonly used in linear models, regularization concepts can be applied to non-linear models as well, like decision trees and neural networks.

13. Is Regularization Always Beneficial?

Regularization can significantly improve a model's performance on unseen data, but it can also lead to underfitting if the regularization parameter is too high, which necessitates careful tuning.

14. How to Implement Regularization in Practice?

In practice, regularization is implemented by adding the regularization term to the loss function. Most machine learning libraries provide options to include regularization directly in model constructors.

Recommended Courses

Recommended

Machine Learning Course

Rating: 4.5/5

Learn Now

Deep Learning Course

Rating: 4.3/5

Learn Now

NLP Course

Rating: 4/5

Learn Now

Dataaspirant

How to Handle Overfitting With Regularization

Understanding Overfitting in Machine learning

How to identify Overfitting?

Train and Test loss

Prediction Graph

Techniques to fix Overfitting

Train with more data

Don’t train with highly complex models

Cross-validation

Remove unnecessary features

Regularization

What is Regularization?

L1 Regularization

L2 Regularization

Ridge and Lasso Regression

Lasso Regression

Ridge Regression

Regularization implementation in python

Linear Regression Implementation

Lasso Regression (L1) Implementation

Ridge Regression (L2) Implementation

Conclusion

Frequently Asked Questions (FAQs) On Handle Overfitting With Regularization

1. What is Overfitting in Machine Learning?

2. How Does Regularization Help in Handling Overfitting?

3. What are Common Regularization Techniques?

4. What is L1 Regularization (Lasso)?

5. How Does L2 Regularization (Ridge) Work?

6. What is Elastic Net Regularization?

7. Can Regularization Be Used for All Types of Models?

8. How Do You Choose Between L1 and L2 Regularization?

9. What is the Role of the Regularization Parameter?

10. Does Regularization Affect Model Bias and Variance?

11. How Does Regularization Work in Neural Networks?

12. Can Regularization be Used in Non-Linear Models?

13. Is Regularization Always Beneficial?

14. How to Implement Regularization in Practice?

Recommended Courses

Machine Learning Course

Deep Learning Course

NLP Course

Follow us:

FACEBOOK| QUORA |TWITTER| GOOGLE+ | LINKEDIN| REDDIT | FLIPBOARD | MEDIUM | GITHUB

Leave a Reply Cancel reply

Awarded top 75 data science blog

Data Science Dojo

Udacity

Recent Posts

Build Your Career In AI With Andrew ng Deep learning courses

Categories

Quick Links

Recent Posts

Categories