# Popular Feature Selection Methods in Machine Learning

# Popular Feature Selection Methods in Machine Learning

Feature selection is the key influence factor for building accurate **machine learning models**. Let’s say for any given dataset the machine learning model learns the **mapping between** the input features and the target variable.

So, for a new dataset, where the target is unknown, the model can accurately predict the target variable.

In machine learning, many factors affect the **performance of a model**, and they include:

- Algorithm choice,
- The features used to train the model,
**Parameters used in the algorithm**- Quality of the dataset

Occasionally in a dataset, the set of features in their raw form do not provide the optimal information to train and to perform the prediction.

Therefore, it is beneficial to discard the conflicting and unnecessary features from our dataset by the process known as **feature selection methods** or feature selection techniques.

Learn the popular feature selection methods to build the accurate models. #machinelearing #datascience #python #featureselection

In machine learning, we define a feature as:

An individual measurable property or a characteristic feature of a phenomenon under observation.

Each feature or column represents a measurable piece of data, which helps for analysis. Examples of feature variables are

- Name,
- Age,
- Gender,
- Education qualification,
- Salary etc.

If you observe the above features for a machine learning model, **names** won’t add any significant information.

We are having various techniques to **convert the text data to numerical**. But in this case the **name feature** is not helpful.

Manually we can remove these, but sometimes the nonsignificant features are not necessary for text data. It could be **numerical features** too.

How do we remove those features before going to the modeling phase?

Here comes the technique **feature section** method, which helps identify the key features to build the model.

Now, we define the feature selection process as under:

“The method of reducing the number of input variables during the development of a predictive model.”

**OR**

“Feature selection is a process of automatic selection of a subset of relevant features or variables from a set of all features, used in the process of model building.”

Other names of feature selection are variable selection or attribute selection.

It is possible to select those characteristic variables or features in our data that are most useful for building accurate models.

So how can we filter out the best features out of all the available features?

To achieve that, we have various feature selection methods.

So In this article, we will explore those feature selection methods that we can use to identify the best features for our machine learning model.

After reading this article, you will get to know about the following:

- Two main types of feature selection techniques are supervised and unsupervised, and the supervised methods are further classified into the wrapper, filter, and intrinsic methods.
- Filter-based feature selection methods use statistical techniques to score the dependence or correlation between input variables, which are further filtered to choose the most relevant features.
- Statistical measures must be carefully chosen for feature selection on the basis of the data type of the input variable and the response (output) variable.

Before we start learning, Let’s look at the topics you will learn in this article. Only if you read the complete article 🙂

## Why is Feature Selection Important?

Feature Selection is one of the key concepts in machine learning, which highly impacts the **model’s performance**.

Irrelevant and misleading data features can **negatively impact** the performance of our machine learning model. That is why feature selection and data cleaning should be the first step of our model designing.

These feature selection methods reduce the number of **input variables/features** to those that are considered to be useful in the prediction of the target.

So, the primary focus of feature selection is to:

Removenon-informative or redundant predictors from our machine learning model.”

Some predictive modeling problems contain a large number of variables that require a large amount of system memory, and therefore, retard the development and training of the models.

The importance of feature selection in building a machine learning model is:

- It
**improves the accuracy**with which the model is accurately able to predict the target variable of the unseen dataset. - It
**reduces**the computational cost of the model. - It improves the
**understandability**of the model by removing the unnecessary features so that it becomes more interpretable.

## Benefits of Feature Selection

Having irrelevant features in your data can **decrease the accuracy** of many models, especially **linear algorithms** like linear and **logistic regression**.

The benefits of performing feature selection before modeling the model are as under:

**Reduction in Model Overfitting:**Less redundant data implies less opportunity to make noise based decisions.**Improvement in Accuracy:**Less misleading and misguiding data implies improvement in modeling accuracy.**Reduction in Training Time:**Fewer data implies that algorithms train at a faster rate.

### Difference Between Supervised and Unsupervised methods

We can think of the feature selection methods in terms of **supervised and unsupervised** methods.

The methods that attempt to discover the relationship between the input variables also called independent variables and the target variable, are termed as the supervised methods.

They intend to identify the relevant features for achieving the high accurate model while relying on the labeled data availability.

Examples of supervised learning algorithms are:

The methods that do not require any **labeled** sensor data to predict the relationship between the input and the output variables are termed as **unsupervised methods**.

They find interesting activity patterns in **unlabelled data** and score all data dimensions based on various criteria such as variance, entropy, and ability to preserve local similarity, etc.

For example,

**Clustering** includes customer segmentation and understands different customer groups around which the marketing and business strategies are built.

Unsupervised feature learning methods **don’t** consider the target variable, such as the methods that remove the redundant variables using correlation.

On the contrary, the supervised feature selection techniques make use of the target variable, such as the methods which remove the irrelevant and misleading variables.

## Supervised Feature Selection Methods

Supervised feature selection methods are further classified into three categories.

- Wrapper method,
- Filter method,
- Intrinsic method

### Wrapper Feature Selection Methods

The wrapper methods create several models which are having different subsets of input feature variables. Later the selected features which result in the best performing model in accordance with the performance metric.

The wrapper methods are unconcerned with the variable types, though they can be computationally expensive.

A well-known example of a wrapper feature selection method is **Recursive Feature Elimination** (RFE).

RFE performs the evaluation of multiple models using procedures that add or remove predictor variables to find the optimal combination that maximizes the model’s performance.

### Filter Feature Selection Methods

The filter feature selection methods make use of statistical techniques to predict the relationship between each independent input variable and the output (target) variable. Which assigns **scores** for each feature.

Later the scores are used to **filter out **those input variables/features that we will use in our feature selection model.

The filter methods evaluate the significance of the feature variables only based on their inherent characteristics without the incorporation of any learning algorithm.

These methods are computationally inexpensive and faster than the wrapper methods.

The filter methods may provide worse results than wrapper methods if the data is insufficient to model the statistical correlation between the feature variables.

Unlike wrapper methods, the filter methods are not **subjected to overfitting**. They are used extensively on high dimensional data.

However, the wrapper methods have prohibitive computational cost on such data.

### Embedded or Intrinsic Feature Selection Methods

The machine learning models that have feature selection naturally incorporated as part of learning the model are termed as embedded or intrinsic feature selection methods.

Built-in feature selection is incorporated in some of the models, which means that the model includes the predictors that help in maximizing accuracy.

In this scenario, the machine learning model chooses the best representation of the data.

The examples of the algorithms making use of embedded methods are penalized regression models such as

Some of these machine learning models are naturally resistant to non-informative predictors.

The rule-based models like Lasso and decision trees intrinsically conduct feature selection.

Feature selection is related to dimensionality reduction, but both are different from each other. Both methods seek to reduce the number of variables or features in the dataset, but still, there is a subtle difference between them.

Let’s learn the difference in details.

**Feature selection**simply selects and excludes given characteristic features without excluding them. It includes and excludes the characteristic attributes in the data without changing them.**Dimensionality reduction**transforms the features into a lower dimension. It reduces the number of attributes by creating new combinations of attributes.

The examples of dimensionality reduction methods are

- Principal Component Analysis,
- Singular Value Decomposition.

## Feature Selection with Statistical Measures

We can use correlation type statistical measures between input and output variables, which can then be used as the basis for filter feature selection.

The choice of statistical measures highly depends upon the variable data types.

Common variable data types include:

- Numerical such as height
- Categorical such as a label

Both of the variable data types are subdivided into many categories, which are as under:

Numerical variables are divided into the following:

- Integer Variables
- Float Variables

On the other hand, categorical variables are divided into the following:

- Boolean Variables
- Nominal Variables
- Ordinal Variables

We will be considering the **categories of variables**, i-e, numerical, and categorical, along with input and output.

The variables that are provided as input to the model are termed as input variables. In feature selection, the input variables are those which we wish to reduce in size.

On the contrary, output variables are those on the basis of which the model is predicted. They are also termed as response variables.

Response variables generally indicate the type of predictive modeling problem being performed. For example:

- The numerical output variable depicts a regression predictive modeling problem.
- The categorical output variable depicts a classification predictive modeling problem.

### Univariate Feature Selection

In feature-based filter selection, the statistical measures are calculated considering only a single input variable at a time with a target (output) variable.

These statistical measures are termed as univariate statistical measures, which means that the interaction between input variables is not considered in the filtering process.

Univariate feature selection selects the best features on the basis of univariate statistical tests. We compare each feature to the target variable in order to determine the significant statistical relationship between them.

Univariate feature selection is also called **analysis of variance** ( ANOVA). The majority of the techniques are univariate means that they perform the predictor evaluation in isolation.

The existence of the correlated predictors increases the possibility of selecting significant but redundant predictors. Consequently, a large number of predictors are chosen, which results in the rise of collinearity problems.

In univariate feature selection methods, we examine each feature individually to determine the features’ relationship with the response variable.

The following methods use various techniques to evaluate the input-output relation.

- Numerical Input & Numerical Output
- Numerical Input & Categorical Output
- Categorical Input & Numerical Output
- Categorical Input & Categorical Output

Let's discuss each of these in detail.

#### Numerical Input & Numerical Output

It is a type of regression predictive modeling problem having numerical input variables.

Common techniques include using a correlation coefficient, such as:

- Pearson’s for a linear correlation
- Rank-based methods for a nonlinear correlation.

#### Numerical Input & Categorical Output

It is considered to be a classification predictive modeling problem having numerical input variables. It is the most common example of a classification problem.

Again here, the common techniques are correlation-based though we took the categorical target into account.

The techniques are as under:

- Univariate feature selection or analysis of variables (
**ANOVA**) for a linear correlation **Kendall’s rank coefficient**for a nonlinear correlation assuming that the categorical variable is ordinal.

#### Categorical Input & Numerical Output

It is considered as a strange example of a regression predictive modeling problem having categorical input variables.

We can use the same “Numerical Input, Categorical Output” methods as discussed above but in **reverse**.

#### Categorical Input & Categorical Output

It is considered as a classification predictive modeling problem having categorical input variables.

The following techniques are used in this predictive modeling problem.

- Chi-Squared test
- Mutual Information

The chi-squared test is the most common correlation measure for categorical data. It tests if there exists a significant difference between the observed and the expected frequencies of two categorical variables.

Therefore, based on the Null hypothesis, there exists no association between both variables.

For applying the chi-squared test to determine the relationship between various features in the dataset and the target variable, the following conditions must be met:

- The variables under consideration must be categorical.
- The variables must be sampled independently.
- The values must have an expected frequency greater than 5.

Just to summarize the above concepts, we are providing you with an image that explains everything.

## Feature Selection Strategies

While building a **machine learning model** in real-life, it is uncommon that all variables in the dataset are useful for the perfect model building.

The overall accuracy and the generalization capability of the model are reduced by the addition of redundant variables. Furthermore, the complexity of the model is also increased by adding more and more variables.

In this section, some additional considerations using filter-based feature selection are mentioned, which are:

- Selection Method
- Transform Variables

### Selection Method

The scikit-learn library provides a wide variety of filtering methods after the statistics are calculated for each input (independent) variable with the target (dependent) variable.

The most commonly used methods are:

- Selection of the top k variables i-e; SelectKBest is the sklearn feature selection method used here.
- Selection of the top percentile variables i-e; SelectPercentile is the sklearn feature selection method used for this purpose.

### Transform Variables

Variables can be transformed into one another in order to access different statistical measures.

For example, we can transform a categorical variable into an ordinal variable. Also, we can transform a numerical value into a discrete one, etc., and see the interesting results coming out.

So, we can transform the data to meet the test requirements so that we can try and compare the results.

## Which Feature Selection Method is the Best?

None of the feature selection methods can be regarded as the best method. Even speaking on a universal scale, there is no best machine learning algorithm or the best set of input variables.

Instead, we need to discover which feature selection will work best for our specific problem using careful, systematic experimentation.

So, we try a range of models on different subsets of features chosen using various statistical measures and then discover what works best for our concerned problem.

## Feature Selection Implementations

The following section depicts the worked examples of feature selection cases for a regression problem and a classification problem.

### Feature Selection For Regression models

The following code depicts the feature selection for the **regression problem** as numerical inputs and numerical outputs.

You can download the dataset from this kaggle dataset. Please download the training dataset. The following output is generated on running the above code:

We used the **chi-squared** statistical test for non-negative integers, and by using the SelectKBest class, we selected the top 10 features for our model from Mobile Price Range Prediction Dataset.

When we run the above example,

- A regression dataset is created
- feature selection is defined
- Feature selection applied to the regression dataset
- We get a subset of selected input features

### Classification Feature Selection

The following code depicts the feature selection for the classification problem as numerical inputs and categorical outputs.

The output of the above code is as:

We got the feature importance of each of our features using the feature importance property of the model. The feature importance depicts the importance of each feature by giving its score.

The higher the score of any feature, the more significant and relevant it is towards our response variable.

When we run the above example,

- A classification dataset is created.
- Feature selection is defined.
- Feature selection is applied to the regression dataset.
- We get a subset of selected input features.

## What Next?

Don’t limit yourself with the above two example code. Try to play with other feature selection methods we explained.

Just to cross-check, build any **machine learning model** without applying any feature selection methods, then pick any feature selection method and try to check the accuracy.

For **c****lassification problems**, you can leverage the famous **classification evaluation metrics**. For simple cases, you can measure the performance of the model with a **confusion matrix**.

For regression kind of problem, you can check the **R-squared and Adjusted R-square**d measures.

## Conclusion

In this article, we explain the importance of feature selection methods while building machine learning models.

So far, we have learned how to choose statistical measures for filter-based feature selection with numerical and categorical data.

Apart from this, we got an idea of the following:

- The types of feature selection techniques are supervised and unsupervised. The supervised methods are further classified into the
**filter, wrapper, and intrinsic methods**. - Statistical measures are used by filter-based feature selection to score the correlation or dependence between input variables and the output or response variable.
- Statistical measures for feature selection must be carefully chosen on the basis of the data type of the input variable and the output variable.

#### Recommended Machine Learning Courses

#### Complete Supervised Learning Algorithms

Rating: **4.7/5**

#### Python Data Science Specialization Course

Rating: **4.5/5**

#### A to Z Machine Learning with Python

Rating: **4.5/5**