How Lasso Regression Works in Machine Learning

November 26, 2020 Aparna Moorthi

How Lasso Regression Works in Machine Learning

Whenever we talk about the modeling regression algorithms, we hear about linear regression, Lasso regression and Ridge regression.

Some people thing logistic regression also falls under the regression algorithms family, but logistic regression falls under the classification algorithms category.

Regression topics are quite famous and are the basic introduction topics in Machine Learning, Which everyone needs to learn.

Below are the other types of regressions, like

Lasso regression,
Ridge regression,
Polynomial regression,
Stepwise regression,
ElasticNet regression

The above mentioned techniques are majorly used in regression kind of analytical problems.

Learn how the lasso regression works and implementation in python #regression #lassoregression #machinelearning #artificialintelligence #datascience

Click to Tweet

Do you know why the regression models tends to overfit ?

When we increase the degree of freedom (increasing polynomials in the equation) for regression models, they tend to overfit.

Then how can we overcome the overfitting issue ? Do we have any method or techniques for that ?

Yes we do.

That's what we are going to learn in this lasso regression article.

Using the regularization techniques we can overcome the overfitting issue. We have two popular methods lasso regression and ridge regression.

In our ridge regression article we explained the theory behind the ridge regression, also we explained the implementation part in python.

In this article we are going to focus on lasso regression. Before we drive further below are a list of topics we will learn in this article.

Table of Contents

What Is Regression?

What Is Regularization?

Definition Of Lasso Regression in Machine Learning

When To Use Lasso Regression?

The Statistics Of Lasso Regression

Lasso Regression Implementation In Python

Problem Statement

Data Attribute Information

Evaluation metrics

Lasso Regression Modeling Workflow

Conclusion

Before we go further, let’s recap about regression.

What Is Regression?

Regression is a statistical technique used to determine the relationship between one dependent variable and one or many independent variables.

In simple words, a regression analysis will tell you how your result varies for different factors.

For example,

What determines a person's salary?

Many factors, like

Educational qualification,
Experience,
Skillset,
Current job role,
Current company, etc.,

The above features plays a key role in determining the salary.

You can use regression analysis to predict the dependent variable - salary using the mentioned factors.

y = mx+c

Do you remember this equation from our school days?

It is nothing but a linear regression equation. In the above equation, the dependent variable estimates the independent variable.

In mathematical terms,

Y is the dependent value,
X is the independent value,
m is the slope of the line,
c is the constant value.

The same equation terms are called slighted differently in machine learning or the statistical world.

Y is the predicted value,
X is feature value,
m is coefficients or weights,
c is the bias value.

Linear Regression

The line in the above graph represents the linear regression model. You can see how well the model fits the data.

It looks like a good model, but sometimes the model fits the data points too much, mean the line passes through most for the data point in the above graph.

Which resulting in overfitting.

To create the line (red) using the actual value, the regression model will iterate and recalculate the m(coefficient) and c (bias) values while trying to reduce the loss values with the proper loss function.

The model will have low bias and high variance due to overfitting. Which means on training dataset our model give high accuracy, whereas on the test dataset it gives low accuracy.

This senario we call it as the model is overfitted to the data. How can we tackle this issue.

Regularization comes into play to tackle this issue.

What Is Regularization?

What is Regularization

Regularization solves the problem of overfitting. Overfitting causes low model accuracy. It happens when the model learns the data as well as the noises in the training set.

Noises are random datum in the training set which don't represent the actual properties of the data.

Y ≈ C0 + C1X1 + C2X2 + …+ CpXp

Y represents the dependent variable, X represents the independent variables and C represents the coefficient estimates for different variables in the above linear regression equation.

The model fitting involves a loss function known as the sum of squares. The coefficients in the equation are chosen in a way to reduce the loss function to a minimum value. Wrong coefficients get selected if there is a lot of irrelevant data in the training set.

This will not go well for model predictions in the future.

In cases like this, we can use regularization to regularize or shrink these wrongly learned coefficients to zero. Lasso regression is one of the popular techniques used to improve model performance.

Definition Of Lasso Regression in Machine Learning

Lasso regression is like linear regression, but it uses a technique "shrinkage" where the coefficients of determination are shrunk towards zero.

Linear regression gives you regression coefficients as observed in the dataset. The lasso regression allows you to shrink or regularize these coefficients to avoid overfitting and make them work better on different datasets.

This type of regression is used when the dataset shows high multicollinearity or when you want to automate variable elimination and feature selection.

When To Use Lasso Regression?

Choosing a model depends on the dataset and the problem statement you are dealing with. It is essential to understand the dataset and how features interact with each other.

Lasso regression penalizes less important features of your dataset and makes their respective coefficients zero, thereby eliminating them. Thus it provides you with the benefit of feature selection and simple model creation.

So, if the dataset has high dimensionality and high correlation, lasso regression can be used.

The Statistics Of Lasso Regression

Statistics of lasso regression

d1, d2, d3, etc., represents the distance between the actual data points and the model line in the above graph.

Least-squares is the sum of squares of the distance between the points from the plotted curve.

In linear regression, the best model is chosen in a way to minimize the least-squares.

While performing lasso regression, we add a penalizing factor to the least-squares. That is, the model is chosen in a way to reduce the below loss function to a minimal value.

D = least-squares + lambda * summation (absolute values of the magnitude of the coefficients)

Lasso regression penalty consists of all the estimated parameters. Lambda can be any value between zero to infinity. This value decides how aggressive regularization is performed. It is usually chosen using cross-validation.

Lasso penalizes the sum of absolute values of coefficients. As the lambda value increases, coefficients decrease and eventually become zero. This way, lasso regression eliminates insignificant variables from our model.

Our regularized model may have a slightly high bias than linear regression but less variance for future predictions.

Lasso Regression Implementation In Python

Lasso Regression Implementation

Let us take a regression problem statement and solve it using lasso regression to learn the implementation in Python.

Problem Statement

Real estate is a fairly big industry and the housing prices keep varying regularly based on different factors. The problem statement here is to predict housing prices as accurately as possible.

The housing dataset has 506 rows and 13 numerical inputs and one numerical output.

Data Attribute Information

CRIM - the per capita crime rate by town
ZN - the proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - the proportion of non-retail business acres per town
CHAS - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - the average number of rooms per dwelling
AGE - the proportion of owner-occupied units built before 1940
DIS - weighted distances to five Boston employment centers
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $ 1000's

Evaluation metrics

We will use two evaluation metrics, RMSE & R-square to evaluate our model performance. Root Mean Squared Error(RMSE) is the standard deviation of residuals.

Residuals show the distance between the predicted data points and actual data points.This shows how good the build regression model was.

In the same way to identify how good the build classification model we have various evaluation metrics.

Since this is a direct measure of prediction errors, we should aim for a low value. The R-squared value represents how good a model fit is and how close the data are to the regression line.

A high R-squared shows a good model fit.

If we are building the regression models with many features along with R-squared value we need to use Adjusted R-squared measure.

If you want learn about R-squared and Adjusted R-squared measure you can read this article.

Lasso Regression Modeling Workflow

We will follow the following steps to produce a lasso regression model in Python,

Step 1: Load the required modules and libraries
Step 2: Load and analyze the dataset given in the problem statement
Step 3: Create training and test dataset
Step 4: Build the model and find predictions for the test dataset
Step 5: Evaluate the lasso regression model

Let's start the workflow with loading the required libraries.

Load the required modules and libraries

We will import the pandas and numpy module to handle the dataset and train_test_split module to create training and test datasets.

The r2_score, sqrt and mean_squared_error modules are imported to calculate evaluation metrics. The lasso module from scikit-learn will be used to build our lasso regression model.

Load and analyze the dataset given in the problem statement

Let us load the dataset and analyze the basics like shape and summary statistics of the dataset.

Create training and test dataset

We are going to split the dataset into a training set and test set. We will build our lasso model on the training set and evaluate it using our test set.

Specify the input columns as X and the target column as Y and use the test_size argument in the train_test_split module to split the dataset. We are splitting our dataset into 70% training data and 30% test data here.

Build the model and find predictions for the test dataset

Let us instantiate the lasso model and fit the model to the training set. We will use this fitted model to predict the housing prices for the training set and test set.

Evaluate the lasso model

Evaluate the model by finding the RMSE and R-Square for both the training and test predictions.

As you can see, we have set the lasso hyperparameter - alpha as 1 or a full penalty. This alpha value is giving us a decent RMSE as of now. But, there might be a different alpha value which can provide us with better results.

Let us tune our model to check this.

The sci-kit learn library has a built-in algorithm called LassoCV which will do the tuning for us. This algorithm will find the best alpha value and complete the model tuning simultaneously during training itself. Predictions can then be made using the fit model.

By default, the model will do the tuning using 100 alpha values. We can control this by specifying the alphas argument with a grid of alpha values. The range of alpha values has been set between 0-1 with an interval of 0.02 in the below code.

LassoCV has chosen the best alpha value as 0, meaning zero penalty. You can see that the RMSE and R-Square scores have improved slightly with the alpha value selected.

To get the complete code used in this article, please visit Dataaspirant Github account. To get all our article codes, use this like to fork.

Conclusion

We have learned about the lasso regression model in machine learning in this article. We have also covered a few interesting topics like regression, overfitting, regularization, lasso model evaluation and tuning.

To summarize,

Regression is a popular statistical technique used in machine learning to predict an numerical output.
Overfitting happens while doing regression due to the irrelevant noises in the training dataset.
Regularization can be used to avoid overfitting by regularizing the regression models.
Lasso regression is a regularization algorithm which can be used to eliminate irrelevant noises and do feature selection and hence regularize a model.
Evaluation of the lasso model can be done using metrics like RMSE and R-Square.
Alpha is a hyper-parameter in the lasso model which can be tuned using lassoCV to control the regularization.