# Feature Selection Techniques in Machine Learning [2023 Edition]

Feature selection techniques are the key influence factor for building accurate **machine learning models**. Let’s say for any given dataset the machine learning model learns the **mapping between** the input features and the target variable.

So, for a new dataset, where the target is unknown, the model can accurately predict the target variable.

In machine learning, many factors affect the **performance of a model**, and they include:

- Algorithm choice,
- The features used to train the model,
**Parameters used in the algorithm**- Quality of the dataset

Occasionally in a dataset, the set of features in their raw form do not provide the optimal information to train and to perform the prediction.

Therefore, it is beneficial to discard the conflicting and unnecessary features from our dataset by the process known as **feature selection methods** or feature selection techniques.

Learn the popular feature selection methods to build the accurate models. #machinelearing #datascience #python #featureselection

In machine learning, we define a feature as:

An individual measurable property or a characteristic feature of a phenomenon under observation.

Each feature or column represents a measurable piece of data, which helps for analysis. Examples of feature variables are

- Name,
- Age,
- Gender,
- Education qualification,
- Salary etc.

If you observe the above features for a machine learning model, **names** won’t add any significant information.

We are having various techniques to **convert the text data to numerical**. But in this case the **name feature** is not helpful.

Manually we can remove these, but sometimes the nonsignificant features are not necessary for text data. It could be **numerical features** too.

How do we remove those features before going to the modeling phase?

Here comes the technique **feature section** method, which helps identify the key features to build the model.

## What is Feature Selection ?

Now, we define the feature selection process as under:

“The method of reducing the number of input variables during the development of a predictive model.”

**OR**

“Feature selection is a process of automatic selection of a subset of relevant features or variables from a set of all features, used in the process of model building.”

Other names of feature selection are variable selection or attribute selection.

It is possible to select those characteristic variables or features in our data that are most useful for building accurate models.

So how can we filter out the best features out of all the available features?

To achieve that, we have various feature selection methods.

So In this article, we will explore those feature selection methods that we can use to identify the best features for our machine learning model.

After reading this article, you will get to know about the following:

- Two main types of feature selection techniques are supervised and unsupervised, and the supervised methods are further classified into the wrapper, filter, and intrinsic methods.
- Filter-based feature selection methods use statistical techniques to score the dependence or correlation between input variables, which are further filtered to choose the most relevant features.
- Statistical measures must be carefully chosen for feature selection on the basis of the data type of the input variable and the response (output) variable.

Before we start learning, Let’s look at the topics you will learn in this article. Only if you read the complete article 🙂

## Why Feature Selection Techniques are Important?

Feature Selection techniques are the key concepts in machine learning, which highly impacts the **model’s performance**.

Irrelevant and misleading data features can **negatively impact** the performance of our machine learning model. That is why feature selection and data cleaning should be the first step of our model designing.

These feature selection methods reduce the number of **input variables/features** to those that are considered to be useful in the prediction of the target.

So, the primary focus of feature selection is to:

Removenon-informative or redundant predictors from our machine learning model.”

Some predictive modeling problems contain a large number of variables that require a large amount of system memory, and therefore, retard the development and training of the models.

The importance of feature selection in building a machine learning model is:

- It
**improves the accuracy**with which the model is accurately able to predict the target variable of the unseen dataset. - It
**reduces**the computational cost of the model. - It improves the
**understandability**of the model by removing the unnecessary features so that it becomes more interpretable.

## Benefits of Feature Selection Techniques

Having irrelevant features in your data can **decrease the accuracy** of many models, especially **linear algorithms** like linear and **logistic regression**.

The benefits of performing feature selection before modeling the model are as under:

**Reduction in Model Overfitting:**Less redundant data implies less opportunity to make noise based decisions.**Improvement in Accuracy:**Less misleading and misguiding data implies improvement in modeling accuracy.**Reduction in Training Time:**Fewer data implies that algorithms train at a faster rate.

### Difference Between Supervised and Unsupervised methods

We can think of the feature selection methods in terms of **supervised and unsupervised** methods.

The methods that attempt to discover the relationship between the input variables also called independent variables and the target variable, are termed as the supervised methods.

They intend to identify the relevant features for achieving the high accurate model while relying on the labeled data availability.

Examples of supervised learning algorithms are:

The methods that do not require any **labeled** sensor data to predict the relationship between the input and the output variables are termed as **unsupervised methods**.

They find interesting activity patterns in **unlabelled data** and score all data dimensions based on various criteria such as variance, entropy, and ability to preserve local similarity, etc.

For example,

**Clustering** includes customer segmentation and understands different customer groups around which the marketing and business strategies are built.

Unsupervised feature learning methods **don’t** consider the target variable, such as the methods that remove the redundant variables using correlation.

On the contrary, the supervised feature selection techniques make use of the target variable, such as the methods which remove the irrelevant and misleading variables.

## Supervised Feature Selection Techniques

Supervised feature selection methods are further classified into three categories.

- Wrapper methods,
- Filter methods,
- Intrinsic methods

Let's dissuss above these techniques in depth.

## Wrapper Feature Selection Methods

The wrapper methods create several models which are having different subsets of input feature variables. Later the selected features which result in the best performing model in accordance with the performance metric.

The wrapper methods are unconcerned with the variable types, though they can be computationally expensive.

A well-known example of a wrapper feature selection method is **Recursive Feature Elimination** (RFE).

RFE performs the evaluation of multiple models using procedures that add or remove predictor variables to find the optimal combination that maximizes the model’s performance.

In Summary,

Wrapper methods for feature selection involve the use of a specific machine learning algorithm to evaluate and compare the performance of different subsets of features and select the best-performing subset. Here's a list of common wrapper methods with explanations:

Below are the popular Wrapper feature selection techniques which fall under Wrapper feature selection type.

- Recursive Feature Elimination (RFE)
- Sequential Feature Selector (SFS)
- Stepwise Regression
- Genetic Algorithms
- Cross-Validation Selection
- Shapley Value Regression
- Exhaustive Feature Selector

### 1. Recursive Feature Elimination (RFE):

Recursive Feature Elimination works by recursively removing features and building a model on those features that remain. It uses the model accuracy to identify which features contribute the most to predicting the target variable. In each iteration, the least important feature is removed until the specified number of features is reached.

### 2. Sequential Feature Selector (SFS):

Sequential Feature Selector is a search technique that either starts with no features and adds them one by one (forward selection) or starts with all the features and removes them one by one (backward elimination). At each step, it selects the feature that has the best score when added to or removed from the set of selected features.

### 3. Stepwise Regression:

Stepwise Regression is a combination of both forward selection and backward elimination. It starts with no features and adds the best feature at each step, as in forward selection. After adding each new variable, it checks if some variables can be removed without a significant loss of fit, similar to backward elimination.

#### 4. Genetic Algorithms:

Genetic Algorithms (GAs) are inspired by the process of natural selection and use methods such as mutation, crossover, and selection to iteratively select the best subset of features. It treats each subset as an individual in a population and evolves these individuals across generations to find the best-performing subset.

### 5. Cross-Validation Selection:

This method involves using cross-validation to assess the performance of feature subsets. The cross-validation results guide the selection process — subsets that perform better in terms of cross-validated performance metrics are preferred.

### 6. Shapley Value Regression:

Based on cooperative game theory, the Shapley value is a way to assign a payoff to each feature, considering all possible feature combinations. This method determines the contribution of each feature to the prediction model by considering the value it adds when included with different subsets of features.

### 7. Exhaustive Feature Selector:

An exhaustive search method evaluates all possible feature combinations to find the subset that produces the best model performance. Although this method can find the best subset, it is computationally very expensive and often impractical for datasets with a large number of features.

Wrapper methods are generally more computationally intensive than filter methods because they involve the training of models for each feature subset considered. However, because they involve the use of a specific model, they can often find features that are better tailored to the performance of that particular model.

## Filter Feature Selection Methods

The filter feature selection methods make use of statistical techniques to predict the relationship between each independent input variable and the output (target) variable. Which assigns **scores** for each feature.

Later the scores are used to **filter out **those input variables/features that we will use in our feature selection model.

The filter methods evaluate the significance of the feature variables only based on their inherent characteristics without the incorporation of any learning algorithm.

These methods are computationally inexpensive and faster than the wrapper methods.

The filter methods may provide worse results than wrapper methods if the data is insufficient to model the statistical correlation between the feature variables.

Unlike wrapper methods, the filter methods are not **subjected to overfitting**. They are used extensively on high dimensional data.

However, the wrapper methods have prohibitive computational cost on such data.

Below are the popular feature selection techniques which fall under Filter feature selection type.

- Mean Absolute Difference
- Chi-Square Test
- Information Gain
- Fisher's Score
- Correlation Coefficient
- Variance Threshold
- Dispersion Ration

Here's a simple and clear explanation for each of the popular filter feature selection techniques:

### 1. Mean Absolute Difference (MAD):

The Mean Absolute Difference measures the extent to which the numerical features of a dataset vary from their average values. In essence, it calculates the average absolute deviation of each feature from its mean.

Features with a higher MAD are considered to have more variability and are often assumed to carry more information than those with lower MAD, making them good candidates for selection.

### 2. Chi-Square Test:

The Chi-Square test is a statistical method used to determine if there is a significant association between categorical variables. It compares the observed distribution of categories to an expected distribution if there were no relationship.

In feature selection, it helps to identify features that have a strong relationship with the output variable. The higher the Chi-Square value, the more likely the feature is influential and should be selected.

#### 3. Information Gain:

Information Gain measures the reduction in entropy or surprise from transforming a dataset in some way. It is often used in the construction of decision trees. When used in feature selection, it evaluates each feature by determining how well it can separate the classes.

A feature with high information gain is more capable of splitting the data into pure classes, which is desirable for classification tasks.

#### 4. Fisher's Score:

Fisher's Score, or Fisher Criterion, ranks features by the ratio of the variance between classes to the variance within classes.

Features with a high Fisher's Score are considered to have strong discriminative power as they exhibit high variation between classes while maintaining low variation within each class.

#### 5. Correlation Coefficient:

The Correlation Coefficient measures the linear relationship between two features. When it comes to feature selection, a correlation matrix can be used to identify and remove features that are highly correlated with each other.

This is because highly correlated features provide redundant information, and removing them can reduce overfitting without losing significant predictive power.

#### 6. Variance Threshold:

This technique involves setting a threshold for the variance of feature values and dropping features whose variance does not meet this threshold.

Since features with low variance are less likely to affect the target variable, this method is effective at removing features that are considered to be constant or quasi-constant.

#### 7. Dispersion Ratio:

Similar to variance thresholding, the Dispersion Ratio is used to measure the spread or dispersion of a feature. It compares the mean difference (or another measure of central tendency) to the spread of the data.

A low Dispersion Ratio indicates that the feature does not vary much across the data points and may not contribute much information for the model's prediction, hence could be considered for removal.

Each of these methods serves the purpose of reducing the number of input variables to those that are most useful to the model, which can result in decreased computational cost and improved model performance.

## Embedded or Intrinsic Feature Selection Method

The machine learning models that have feature selection naturally incorporated as part of learning the model are termed as embedded or intrinsic feature selection methods.

Built-in feature selection is incorporated in some of the models, which means that the model includes the predictors that help in maximizing accuracy.

In this scenario, the machine learning model chooses the best representation of the data.

The examples of the algorithms making use of embedded methods are penalized regression models such as

Some of these machine learning models are naturally resistant to non-informative predictors.

The rule-based models like Lasso and decision trees intrinsically conduct feature selection.

Feature selection is related to dimensionality reduction, but both are different from each other. Both methods seek to reduce the number of variables or features in the dataset, but still, there is a subtle difference between them.

Let’s learn the difference in details.

**Feature selection**simply selects and excludes given characteristic features without excluding them. It includes and excludes the characteristic attributes in the data without changing them.**Dimensionality reduction**transforms the features into a lower dimension. It reduces the number of attributes by creating new combinations of attributes.

The examples of dimensionality reduction methods are

## Feature Selection with Statistical Measures

We can use correlation type statistical measures between input and output variables, which can then be used as the basis for filter feature selection.

The choice of statistical measures highly depends upon the variable data types.

Common variable data types include:

- Numerical such as height
- Categorical such as a label

Both of the variable data types are subdivided into many categories, which are as under:

Numerical variables are divided into the following:

- Integer Variables
- Float Variables

On the other hand, categorical variables are divided into the following:

- Boolean Variables
- Nominal Variables
- Ordinal Variables

We will be considering the **categories of variables**, i-e, numerical, and categorical, along with input and output.

The variables that are provided as input to the model are termed as input variables. In feature selection, the input variables are those which we wish to reduce in size.

On the contrary, output variables are those on the basis of which the model is predicted. They are also termed as response variables.

Response variables generally indicate the type of predictive modeling problem being performed. For example:

- The numerical output variable depicts a regression predictive modeling problem.
- The categorical output variable depicts a classification predictive modeling problem.

### Univariate Feature Selection

In feature-based filter selection, the statistical measures are calculated considering only a single input variable at a time with a target (output) variable.

These statistical measures are termed as univariate statistical measures, which means that the interaction between input variables is not considered in the filtering process.

Univariate feature selection selects the best features on the basis of univariate **statistical tests**. We compare each feature to the target variable in order to determine the significant statistical relationship between them.

Univariate feature selection is also called **analysis of variance** ( ANOVA). The majority of the techniques are univariate means that they perform the predictor evaluation in isolation.

The existence of the correlated predictors increases the possibility of selecting significant but redundant predictors. Consequently, a large number of predictors are chosen, which results in the rise of collinearity problems.

In univariate feature selection methods, we examine each feature individually to determine the features’ relationship with the response variable.

The following methods use various techniques to evaluate the input-output relation.

- Numerical Input & Numerical Output
- Numerical Input & Categorical Output
- Categorical Input & Numerical Output
- Categorical Input & Categorical Output

Let's discuss each of these in detail.

#### Numerical Input & Numerical Output

It is a type of regression predictive modeling problem having numerical input variables.

Common techniques include using a correlation coefficient, such as:

- Pearson’s for a linear correlation
- Rank-based methods for a nonlinear correlation.

#### Numerical Input & Categorical Output

It is considered to be a classification predictive modeling problem having numerical input variables. It is the most common example of a classification problem.

Again here, the common techniques are correlation-based though we took the categorical target into account.

The techniques are as under:

- Univariate feature selection or analysis of variables (
**ANOVA**) for a linear correlation **Kendall’s rank coefficient**for a nonlinear correlation assuming that the categorical variable is ordinal.

#### Categorical Input & Numerical Output

It is considered as a strange example of a regression predictive modeling problem having categorical input variables.

We can use the same “Numerical Input, Categorical Output” methods as discussed above but in **reverse**.

#### Categorical Input & Categorical Output

It is considered as a classification predictive modeling problem having categorical input variables.

The following techniques are used in this predictive modeling problem.

**Chi-Squared test**- Mutual Information

The chi-squared test is the most common correlation measure for categorical data. It tests if there exists a significant difference between the observed and the expected frequencies of two categorical variables.

Therefore, based on the **Null hypothesis**, there exists no association between both variables.

For applying the chi-squared test to determine the relationship between various features in the dataset and the target variable, the following conditions must be met:

- The variables under consideration must be categorical.
- The variables must be sampled independently.
- The values must have an expected frequency greater than 5.

Just to summarize the above concepts, we are providing you with an image that explains everything.

## Feature Selection Strategies

While building a **machine learning model** in real-life, it is uncommon that all variables in the dataset are useful for the perfect model building.

The overall accuracy and the generalization capability of the model are reduced by the addition of redundant variables. Furthermore, the complexity of the model is also increased by adding more and more variables.

In this section, some additional considerations using filter-based feature selection are mentioned, which are:

- Selection Method
- Transform Variables

### Selection Method

The scikit-learn library provides a wide variety of filtering methods after the statistics are calculated for each input (independent) variable with the target (dependent) variable.

The most commonly used methods are:

- Selection of the top k variables i-e; SelectKBest is the sklearn feature selection method used here.
- Selection of the top percentile variables i-e; SelectPercentile is the sklearn feature selection method used for this purpose.

### Transform Variables

Variables can be transformed into one another in order to access different statistical measures.

For example, we can transform a categorical variable into an ordinal variable. Also, we can transform a numerical value into a discrete one, etc., and see the interesting results coming out.

So, we can transform the data to meet the test requirements so that we can try and compare the results.

## Which Feature Selection Technique is the Best?

None of the feature selection methods can be regarded as the best method. Even speaking on a universal scale, there is no best machine learning algorithm or the best set of input variables.

Instead, we need to discover which feature selection will work best for our specific problem using careful, systematic experimentation.

So, we try a range of models on different subsets of features chosen using various statistical measures and then discover what works best for our concerned problem.

## Feature Selection Techniques Implementations

The following section depicts the worked examples of feature selection cases for a regression problem and a classification problem.

### Feature Selection For Regression models

The following code depicts the feature selection for the **regression problem** as numerical inputs and numerical outputs.

You can download the dataset from this kaggle dataset. Please download the training dataset. The following output is generated on running the above code:

We used the **chi-squared** statistical test for non-negative integers, and by using the SelectKBest class, we selected the top 10 features for our model from Mobile Price Range Prediction Dataset.

When we run the above example,

- A regression dataset is created
- feature selection is defined
- Feature selection applied to the regression dataset
- We get a subset of selected input features

### Classification Feature Selection

The following code depicts the feature selection for the classification problem as numerical inputs and categorical outputs.

The output of the above code is as:

We got the feature importance of each of our features using the feature importance property of the model. The feature importance depicts the importance of each feature by giving its score.

The higher the score of any feature, the more significant and relevant it is towards our response variable.

When we run the above example,

- A classification dataset is created.
- Feature selection is defined.
- Feature selection is applied to the regression dataset.
- We get a subset of selected input features.

## What Next?

Don’t limit yourself with the above two example code. Try to play with other feature selection methods we explained.

Just to cross-check, build any **machine learning model** without applying any feature selection methods, then pick any feature selection method and try to check the accuracy.

For **c****lassification problems**, you can leverage the famous **classification evaluation metrics**. For simple cases, you can measure the performance of the model with a **confusion matrix**.

For regression kind of problem, you can check the **R-squared and Adjusted R-square**d measures.

## Conclusion

In this article, we explain the importance of feature selection methods while building machine learning models.

So far, we have learned how to choose statistical measures for filter-based feature selection with numerical and categorical data.

Apart from this, we got an idea of the following:

- The types of feature selection techniques are supervised and unsupervised. The supervised methods are further classified into the
**filter, wrapper, and intrinsic methods**. - Statistical measures are used by filter-based feature selection to score the correlation or dependence between input variables and the output or response variable.
- Statistical measures for feature selection must be carefully chosen on the basis of the data type of the input variable and the output variable.

## Frequently Asked Questions (FAQs) On Feature Selection Techniques

#### 1. What is Feature Selection in Machine Learning?

Feature selection is the process of identifying and selecting a subset of input variables that are most relevant to the target variable to build efficient and effective predictive models.

#### 2. Why is Feature Selection Important?

It helps to reduce overfitting, improve accuracy, and reduce training time by eliminating irrelevant or redundant data.

#### 3. What is the Difference Between Feature Selection and Feature Extraction?

Feature selection involves selecting a subset of the original features, while feature extraction involves creating new features by combining the original ones (e.g., PCA).

#### 4. Can Feature Selection Improve Model Performance?

Yes, by removing irrelevant features, models can often make more accurate predictions and be more generalizable.

#### 5. What are Filter Methods for Feature Selection?

Filter methods apply a statistical measure to assign a scoring to each feature. Features are ranked by the score and either selected to be kept or removed from the dataset.

#### 6. Can You Name Some Filter Methods?

Common filter methods include mutual information, chi-squared test, ANOVA, and correlation coefficients.

#### 7. What are Wrapper Methods for Feature Selection?

Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated, and compared to other combinations. A predictive model is used to evaluate a combination of features and assign a score based on model accuracy.

#### 8. Can You Give Examples of Wrapper Methods?

Examples include recursive feature elimination, forward feature selection, backward feature elimination, and exhaustive feature selection.

#### 9. What are Embedded Methods for Feature Selection?

Embedded methods perform feature selection as part of the model construction process. The most common example is regularization methods like LASSO that include feature selection as part of the algorithm's objective function.

#### 10. How Does LASSO Perform Feature Selection?

LASSO (Least Absolute Shrinkage and Selection Operator) performs feature selection by applying a penalty to the absolute size of the coefficients. Some coefficients can become exactly zero, which is equivalent to the feature being excluded from the model.

#### 11. What is Dimensionality Reduction in Feature Selection?

Dimensionality reduction is a form of feature extraction that reduces the number of input variables by creating new features that are a combination of the original features.

#### 12. Is Feature Selection Necessary for All Machine Learning Algorithms?

It depends on the algorithm. Some, like decision trees, have built-in feature selection, but many algorithms benefit from feature selection, especially when dealing with high-dimensional data.

#### Recommended Machine Learning Courses

#### Machine Learning Course

Rating: **4.7/5**

#### Deep Learning Course

Rating: **4.5/5**

#### NLP Course

Rating: **4.5/5**