# How to Handle Overfitting With Regularization

# How to Handle Overfitting With Regularization

Overfitting and regularization are the most common terms which are heard in **Machine learning** and Statistics. Your model is said to be overfitting if it performs very well on the training data but fails to perform well on unseen data.

This is one of the most common and dangerous phenomena that occurs when **training your machine learning models**. There are many techniques that you can use to fix this problem. **Regularization** is one among them.

This post will help you understand what is regularization and how it helps in fixing the overfitting problem.

Are you excited tolearn?

Great. Before we start, below are the list of topics you are going to learn in this article.

Learn the smart ways to handle overfitting with regularization techniques #datascience #machinelearning #linearregression

## Understanding Overfitting in Machine learning

Overfitting occurs when the model is trying to learn the data too well. In other words, the model attempts to **memorize** the training dataset. This leads to capturing noise in the training data.

Learning such data points that are present by random chance and don't represent true properties of data makes the model more flexible.

Hence, the **model performs too well** on the training data but fails to perform well on unseen data.

Overfitting is like training your **dog to lay down** when you whistle to it, and it learns the trick perfectly after some practice. But it only lays down to only your whistle and not to others. It is because it is trained for your whistle and not someone else.

The same thing happens to your model when it is **not trained correctly** or if there is not enough variation in the training activity.

The model only works perfectly for the training dataset but fails to generalize for unseen data.

## How to identify Overfitting?

Some of the **parameters** to look out for to identify if the model is overfitting or not are:

- Train and test loss comparison
- Prediction graph
- Leveraging the cross-validation

Let’s understand these in detail.

### Train and Test loss

Loss is the result of a bad prediction. A loss is a number indicating **how bad** the model's prediction was. In general, it is helpful to split the dataset into three parts

- Training,
- Validation,
- Testing.

Validation is for **checking the model’s performance** before testing it for a completely new dataset. It is important to calculate **losses** during training and validation.

If the training loss is **very low**, but the validation loss is very high, then your model is overfitting.

We have different version on saying the same, which is **bias and variance**.

### Prediction Graph

If the data points and the fitted curve are plotted in a graph, and if the curve looks too complex, fitting all the data points perfectly, then your model is overfitting.

## Techniques to fix Overfitting

There are many techniques that you can follow to prevent the overfitting of your model.

### Train with more data

This is not always possible, but if the model is too complicated. Whereas the data available for training is small comparatively, then it is better to increase the size of the training data.

### Don’t train with highly complex models

If you are training a very complex model for relatively less complex data, then the chances for overfitting are very high. Hence, it’s better to **reduce** the model complexity to prevent overfitting.

### Cross-validation

**Cross-validation** is a powerful method to prevent overfitting. The idea of cross-validation is to divide our training dataset into multiple **mini train-test** splits. Each split is called as a **fold**.

We divide the train set into** k folds**, and the model is iteratively trained on k-1 folds, and the remaining 1 fold is used as a test fold. Based on the model performance on the test fold, we **tune its hyperparameters**.

This allows us to tune the hyperparameters with only our original training dataset without any involvement of the original test dataset.

You can know more about Cross-validation here on our article.

### Remove unnecessary features

If there are **too many features**, then the model tends to overfit. So, it is better to **remove unnecessary features** from the dataset before training.

### Regularization

As the word suggests, it is the process of regularizing the model parameters, which discourages learning a more complex model.

In the next session, we will try to understand this process clearly.

## What is Regularization?

Before we discuss regularization, let's think about the below questions.

- What if there is less data for training?
- How do we avoid overfitting?

Let’s assume the data is having too many features, but how do we know which features to remove and which features to keep?

In any machine learning problem, the **accuracy of a model** is a very important factor. In **binary classification models** we check the performance of the model with **confusion matrix**.

For other classification models we are having various **evaluation metrics**.

We can't achieve more accuracy for a model without adding more features, but adding features results in overfitting the data.

Then how do weavoid overfitting?

This is where regularization comes in handy.

Regularization, as the name suggests, is the process of regularizing something. Regularization shrinks the parameters of the model to zero, which **reduces** its freedom.

Hence, the model will be less likely to fit the noise of training data and will improve the generalization ability of the model.

We penalize the **cost function** by adding a penalty that regularizes or **shrinks** the coefficient estimates to zero.

Let’s look at the cost function

where

**hθ(x(i))**is the predicted value of some datapoint x(i)**y(i)**is original

The penalized cost function looks like this

where-

**λ**is the tuning parameter that decides how much we want to penalize the flexibility of our model. It can be tuned using cross-validation.

So each time some parameter tries to become large, it will be penalized to a small value.

There are two kinds of regularization:

- L1 Regularization
- L2 Regularization

### L1 Regularization

This adds a penalty equal to the **L1 norm** of the weights vector(sum of the absolute value of the coefficients). It will shrink some parameters to **zero**.

Hence some variables will not play any role in the model. L1 regression can be seen as a way to select features in a model.

**L1 = L(X,y) + λ|θ|**

### L2 Regularization

This adds a penalty equal to the L2 norm of the weights vector(sum of the squared values of the coefficients). It will force the parameters to be relatively small.

**L2 = L(X,y) + λθ2**

## Ridge and Lasso Regression

Two of the very powerful techniques that use the concept of L1 and L2 regularization are Lasso regression and Ridge regression.

These models are extremely helpful in the presence of a large number of features in the dataset.

### Lasso Regression

- Lasso regression is like linear regression, but it uses L1 regularization to shrink the cost function.
- So, the minimized cost function is the original cost function with some penalty equivalent to the sum of the absolute values of the coefficients’ magnitude.

### Ridge Regression

When the regression model uses L2 regularization, then it is termed as Ridge regression.

The minimized cost function is the original cost function with some penalty equivalent to the sum of the squares of the magnitude of the coefficients.

Look at our articles where we clearly explained Lasso and Ridge regressions:

- Lasso Regression:
**https://dataaspirant.com/lasso-regression/** - Ridge Regression:
**https://dataaspirant.com/ridge-regression/**

## Regularization implementation in python

Now let’s implement Regularization in Python. We are going to use this House Sales dataset. First, let’s import some necessary libraries and clean the dataset.

Now, we’ll check how well different regression models are working.

### Linear Regression Implementation

We’ll be using the scikit-learn library for implementing our algorithms. To understand the library better, you can check out this page, which is the official documentation of scikit-learn.### Lasso Regression (L1) Implementation

We’ll use cross-validation to tune the hyperparameter - “alpha”

### Ridge Regression (L2) Implementation

From the above, we can conclude that Ridge regression with alpha value 0.25 best fits the data. So, we use the ridge regressor with the selected value as the final model.

In this way, we could implement regularization with linear regression models.

## Conclusion

In this blog post, we tried to understand what overfitting is and how to identify it. Overfitting is simply when a model performs very well on training data but fails to generalize the unseen data.

We also understood **What is Regularization** and how it prevents overfitting during training a model.

We learned two different types of linear regression techniques that use regularization. We also implemented them in Python.

Check out the following links which might help you to understand regularization.

- Cross Validation - https://dataaspirant.com/cross-validation/
- Lasso Regression - https://dataaspirant.com/lasso-regression/
- Ridge Regression - https://dataaspirant.com/ridge-regression/
- StatsQuest - https://www.youtube.com/watch?v=Q81RR3yKn30&t=1055s
- Krish Naik - https://www.youtube.com/watch?v=9lRv01HDU0s
- Andrew NG - https://www.youtube.com/watch?v=KvtGD37Rm5I

#### Recommended Courses

#### Machine Learning A to Z Course

Rating: **4.5/5**

#### Python Data Science Specialization Course

Rating: **4.3/5**

#### Complete Supervised Learning Algorithms

Rating: **4/5**