Mastering Ridge Regression: Comprehensive Guide and Practical Applications

How Ridge regression work

Today we are going to learning the most frequently asked interview question Which is What is Ridge Regerssion ?

It’s often, people in the field of analytics or data science limit themselves with the basic understanding of regression algorithms as linear regression. Very few of them are aware of regularization techniques such as ridge regression and lasso regression.

This gives us a little bit of motivation to explain how ridge regression works in machine learning.

How ridge regression works #datascience #machinelearning #linearregression #ridgeregression #python #artificialintellegence

Click to Tweet

In summary, after completing this article, you will gain a better insight into the following concepts.

  • How the ridge regression works
  • How can we use ridge regression for better usage
  • Where ridge regression comes into play
  • How to implement the ridge regression model in python

Before we dive into the details of how ridge regression works, let’s see the flow of concepts you are going to learn in this article.

Ridge Regression in Simple Words

The best and simple version of answer could be:

"Ridge regression is the regularized form of linear regression."

If you are not convinced about the answer, don’t worry at all. By the end of this article, you will get to know the true significance of the justification about ridge regression.

Let's start our discussion with the basic building block linear regression

Linear Regression in Machine Learning

Linear Regression

Linear Regression

Before we learn about ridge regression, we should understand know how linear regression works.

Don’t forget, These pools of regression algorithms fall under the supervised learning algorithms category.

Any modeling task that involves predicting a numerical value given a set of input features termed as regression.  In other words, regression tries to estimate the expected target value when we provide the known input features.

Linear regression is assumed to be the standard algorithm for identifying the linear relationship between the target variable and the input features.

In the above image, the green dots are the actual values, and the red line is the regression line, fitted for the actual data. To populate the equation, we use the line equation.

Y = mX + C

In mathematical terms, 

  • Y is the dependent value, 
  • X is the independent value,
  • m is the slope of the line,
  • c is the constant value.

The same equation terms are called slighted differently in machine learning or the statistical world.

  • Y is the predicted value, 
  • X is feature value,
  • m is  coefficients or weights,
  • c is the bias value.

To create the line (red) using the actual value, the regression model will iterate and recalculate the m (coefficient) and c (bias) values while trying to reduce the loss values with the proper loss function

 In an extension to the linear regression that encourages the models which use small coefficient values, penalties added to the loss function during the training period. 

These extensions were termed as the penalized linear regression or regularized linear regression.

So, ridge regression is a famous regularized linear regression which makes use of the L2 penalty. This penalty shrinks the coefficients of those input variables which have not contributed less in the prediction task.

With this understanding,  let’s learn about ridge regression.

What is Ridge Regression in Machine Learning

Ridge Regression

Ridge Regression

In linear regression, a linear relationship exists between the input features and the target variable. The association is a line in the case of a single input variable. 

Still, with the higher dimensions, the relationship can be assumed to be a hyperplane which connects the input features to the target variable. The coefficients can be found by the optimization method to minimize the error between the predicted output i-e; that and the expected output i-e; y.

Linear regression may encounter problems in which the model's estimated coefficients can become relatively large, making the model so unstable that it becomes sensitive to the inputs. Which relates to the problems with a few observations or variables.

An approach can be adopted to regain the regression model's stability. In which the loss function is modified and includes additional costs for a model with relatively large coefficients.

The linear regression models having the revised version of the loss functions referred to as "Penalized or Regularized Linear Regression." 

Ordinary Least Squares (OLS) of ridge regression

The analysis method estimates the relationship between independent variables (Features) and a dependent variable (Target) termed Ordinary least squares (OLS) regression.

The process predicts the ties by minimizing the sum of the squares in the difference between the observed and predicted values of the dependent variable. 

On the other hand, the linear regression model whose coefficients are not estimated by OLS but by an estimator, commonly known as the ridge estimator, that is biased but has a lower variance than the estimator of Ordinary Least Squares (OLS), is termed as ridge regression.

In regression modeling, the presence of multicollinearity significantly leads to inconsistent parameter estimates. Ordinary Least Squares (OLS), a standard method in regression analysis, results in an inaccurate and unstable model because it is not robust to the multicollinearity problem. 

Several methods have been proposed in the literature to address this model instability issue, and the most common one is ridge regression.

Finding Unbiased Coefficients with OLS

We know that the coefficients that best fit the data found by the least square method. It also helps in finding the unbiased coefficients.

In this scenario, the word unbiased means that OLS does not discriminate between independent variables. 

It does not consider which independent variable is more significant than others. It merely finds the coefficients for a given data set.

Comprehensively, only one set of betas are found, resulting in the lowest' Residual Sum of Squares (RSS)'.

Here the question arises

Is the best model the one which has the lowest RSS?

To answer, we need to know the performance of the  build regression models in both training and the testing phase. Where comest the concept of bias and variance.

Bias & Variance Tradeoff

Bias Variance tradeoff

Bias Variance tradeoff

So still, the question is 

Is the best model the one that has the lowest RSS?

The answer to the question above is, "Not really."

In the word 'Unbiased,' we need to consider 'Bias' too. Bias means equal care a model gives to its predictors. Say there are two models to predict a mango price with two predictors 'sweetness' and 'shine'; one of the models is unbiased while the other is biased.

First, the unbiased model finds the relationship between the two features and the prices, just as the OLS method does. This model will fit the observations in such a way as to minimize the RSS entirely. 

However, the consequences may not be favorable, and overfitting issues may arise. Comprehensively, the model will not be performing well with the new data set.

Because it’s specifically built for the given data set that it may not fit the new set of data.

Underfitting and Overfitting

Underfitting & Overfitting

We can assume that the bias is related to the given model's failure to fit the training set. On the contrary, the variance was associated with the model's inability to fit the testing set. 

Both bias and variance are in a trade-off relationship with one another over the model's complexity, which implies that a straightforward model would have low variance and a high bias and the other way around.

L2 Regularization

Overfitting problems may lead to inaccurate and unstable model building. So, a technique that helps minimize the overfitting problem in machine learning models is known as regularization. 

We call it regularization because it keeps the parameters usual or regularized. Different regression models use different regularization techniques. The regression model using the L1 regularization technique is termed as Lasso regression.

While the regression model uses L2 is termed as Ridge regression.

In this article our focus is on ridge regression, so let's discuss L2 regularization in detail. In the lasso regression article, we will explain L1 regularization techniques.

Regression Line Error

Regression Line Error

In the above figure, the error function is computed based on the training data set. When our given model fits too closely to the training data, then it is called model overfitting. 

In this scenario, the model's performance is excellent on the training dataset but highly inadequate on the testing data set. 

Regularization comes into play and helps in keeping the parameters regular in optimizing the errors.

Elements in L2 Regularization

In the figure below, the L2 regularization element is represented by the highlighted part. "Squared magnitude" of coefficient as penalty term is added to the loss function by ridge regression.

Ridge regression formula

Ridge regression formula

In the formula above, if lambda is zero, then we get OLS. 

However, the high value of lambda will add too much weight. Which will result in model under-fitting

Therefore, it is important how we choose the parameter lambda for our model. We are not able to cover the lasso in this article, So will give a high level comparison between lasso and ridge regression.

Difference Between Lasso and Ridge regression

 The main difference between lasso and ridge regression is the penalty term. The other differences are listed in the tabular form below.

Ridge Regression

  • It makes use of the L2 regularization technique.
  • It performs feature weight updates as the loss function has an additional squared term.
  • It drives down the overall size of the weight values during optimization and reduces overfitting.

Lasso Regression

  • It makes use of the L1 regularization technique.
  • It performs the feature weight updates as the loss function has an additional term containing the L1 norm of the weights vector.
  • It causes the weights of some features to decline to zero at some point, thus effectively eliminating those features which cause high variance and model over-fitting issues.

When to Use Ridge Regression ?

We know that the Ordinary Least Square Method (OLS) treats all the variables in an unbiased manner. So, as more variables are incorporated, the OLS model becomes more complicated. 

The OLS model is on the right side with the low bias and high variance in the figure below. The OLS model's position is stationary and fixed, but the change in position can occur when ridge regression comes into play. 

In ridge regression, the model coefficients will change as we tune the lambda parameter.

Bias

Bias

Geometric Understanding of Ridge Regression

The figure below represents the geometric interpretation to compare OLS and ridge regression.

Each contour is connected to spots where the RSS is the same, centered with the OLS estimate, where the RSS is the lowest. Also, the OLS estimate is the point where it best fits the training set (low-bias).

Ordinary Least Squares

Ordinary Least Squares

Vectorized Version

The vector norm is nothing but the following definition.

Vectorized Version

Vectorized Version

The subscript '2' is as in 'L2 norm'. We only care about the L2 norm at this moment, so we can construct the equation we've already seen.

Mathematical Notation

In the following equation, the first term is OLS and the second term with the lambda parameter makes ridge regression.

OLS mathematical notation

OLS mathematical notation

What We Really Want to Find ?

Having the lambda parameter is often called "penalty," as it causes a significant RSS increase. We try and iterate certain values onto lambda and then evaluate the model with a measurement like

Mean Square Error (MSE). The value of the lambda, which minimizes MSE, is selected as the final one. The ridge regression model is far better than the OLS model in prediction. 

In the formula below, if lambda is equal to zero i-e no penalty, ridge regression becomes the same as OLS.

OLS Updated Formula

OLS Updated Formula

 Ridge Regressions Assumptions

Linear regression assumptions and ridge regression assumptions are more over same. Such as Linearity, Constant variance and features are independent to other features. But we dont need to assume the distribution of errors should be normal distribution. 

To gain the practical experiance for ridge regression, let's learn the step by step process of building the ridge regression with sklearn.

Ridge Regression Sklearn Python Implementation

So, let's start with basic ridge regression implementation in Python wiht sklearn package. First of all, we have to import the following libraries.

To create the sample data we are using the scikit-learn library.

Now, we will define alpha i-e hyperparameter, which determines the strength of regularization. The larger the value of the hyperparameter, the greater the strength of regularization. 

In short, when the alpha is large, the model has a very high bias. With the value=1 of alpha, the model acts identically to that of Linear Regression.

Now, let’s have a look at how the regression line will fit the data.

Figure 1

Now, we do the same thing using the scikit-learn implementation of Ridge Regression. For this, we need first to create and train an instance of the Ridge Class.

Figure 2

As we can see that the regression line was pretty much better when we opted for 0.5 as the value of alpha. Now, we try by putting '10' as the value of our hyperparameter.

Figure 3

Now, putting the value of alpha=100 and then observing the results as:

Figure 5

When the value of alpha tends towards positive infinity, then the regression line will tend towards a zero mean, effectively minimizing the variance across different datasets.

For the complete code, please check out our Github account.

Conclusion

So, we studied ridge regression and compared it with Lasso regression along with Least Square Method. We dived deeply into the ridge regression by viewing it from different angles like a mathematical formula, vectorized notation, and geometric explanation. 

We got an idea that ridge regression is a linear regression with a penalty. Learned that no equation could find the best value of lambda.

We iterated and tried several random different values to evaluate the prediction performances with MSE correctly. 

Consequently, by doing so, we found that the ridge regression model has a better performance than the simple regression model. At the end we implemented the ridge regression in python.

Recommended

Data science and Data Analytics course

Rating: 4.7/5

marko chains

Python Data Science Specialization Course

Rating: 4.5/5

supervised learning

Introduction to Machine Learning

Rating: 4.3/5

Follow us:

FACEBOOKQUORA |TWITTERGOOGLE+ | LINKEDINREDDIT FLIPBOARD | MEDIUM | GITHUB

I hope you like this post. If you have any questions ? or want me to write an article on a specific topic? then feel free to comment below.

6 Responses to “Mastering Ridge Regression: Comprehensive Guide and Practical Applications

Leave a Reply

Your email address will not be published. Required fields are marked *

>