Feature selection techniques with R

feature selection techniques in r

feature selection in r

Feature selection techniques with R

Working in machine learning field is not only about building different classification or clustering models. It’s more about feeding the right set of features into the training models.

This process of feeding the right set of features into the model mainly take place after the data collection process.

Once we have enough data, We won’t feed entire data into the model and expect great results. We need to pre-process the data.

In fact, the challenging and the key part of machine learning processes is data preprocessing.

Below are the key things we indented to do in data preprocessing stage.

  • Feature transformation
  • Feature selection

 Feature transformation is to transform the already existed features into other forms. Suppose using the logarithmic function to convert normal features to logarithmic features.

Feature selection is to select the best features out of already existed features. In this article, we are going to learn the basic techniques to pick the best features for modeling.

Before we drive further. Let’s have a look at the table of contents.

Table of contents:

  • Why modeling is not the final step
  • The role of correlation
  • Calculating feature importance with  regression methods
  • Using caret package to calculate feature importance
  • Random forest for calculating feature importance
  • Conclusion
  • Related courses
Feature selection techniques with R Click To Tweet

Why Modeling is Not The Final Step

Like a coin, every project has two sides.

  • Business side
  • Technical side

The technical side deals with data collection, processing and then implementing it to get results. The business side is what envelops the technical side.

business side and technical side

business side and technical side Image credit : thecompanion.in

It starts with defining the requirements, hands it over to the technical team for generating results and then take over for converting those results into actionable insights. This is why it is necessary for both teams that they know what was implemented behind the scenes in the project.

The team handling the technical part may consider models and process as their core project deliverable but just running the model and getting highly accurate models is never the end goal of the project for the business team.

It is the understanding of the project which makes it actionable. Thus, if you make a model, but you don’t know what is happening around it then it is a black box which may be perfect for lab results but not something that can be put into the production.

While one may not be concerned with each and every detail of what is happening. One is definitely interested in what actionable insights can be derived out of the model. Using variable importance can help achieve this objective.

Most models have a method to generate variable importance which indicates what features are used in the model and how important they are. Variable importance also has a use in the feature selection process.

As the Occam’s Razor principle states.

The simplest models are the best.

Finding the best features to use in the model based on decreasing variable importance helps one to identify and select the features which produce 80% of the results and discard the rest of the variables which account for rest 20% of the accuracy.

Generally looking at variables (Features) one by one can also help in understanding what features are important and figuring out how do they contribute towards solving a business problem.

It is not difficult to derive variable importance based on the methodology being followed.This is why variable importance can be calculated in more than one way. It’s not a rocket science.

This article describes some such ways.

Role of Correlation

Correlation Coefficient

Correlation Coefficient | Image credit http://slideplayer.com/slide/3941317/

If you are working with a model which assumes the linear relationship between the dependent variables, correlation can help you come up with an initial list of importance. It also works as a rough list for nonlinear models.

The idea is that those features which have a high correlation with the dependent variable are strong predictors when used in a model.

Let us generate a random dataset for this article.

# Use the library cluster generation to make a positive definite matrix of 15 features

library(clusterGeneration)

S = genPositiveDefMat("unifcorrmat",dim=15)

# create 15 features using multivariate normal distribution for 5000 datapoints

library(mnormt)

n = 5000

X = rmnorm(n,varcov=S$Sigma)

Let us now create a dependent feature Y plot a correlation table for these features.

# Create a two class dependent variable using binomial distribution

Y = rbinom(n,size=1,prob=0.3)
data = data.frame(Y,X)

# Create a correlation table for Y versus all features
cor(data,data$Y)

            [,1]
Y      1.000000000
X1  -0.013270223
X2  -0.002782848
X3  -0.005647999
X4  -0.018287654
X5  -0.017303147
X6   0.006512963
X7  -0.013494603
X8  -0.008466241
X9  -0.001837453
X10  0.015101810
X11  0.018945108
X12 -0.005708211
X13 -0.009837814
X14 -0.008292952
X15 -0.009675556

As expected, since we are using a randomly generated dataset, there is little correlation of Y with all other features. These numbers may be different for different runs.

In this case, the correlation for X11 seems to be the highest. Had we to necessarily use this data for modeling, X11 will be expected to have the maximum impact on predicting Y. In this way, the list of correlations with the dependent variable will be useful to get an idea of the features that impact the outcome.

While plotting correlations, we always assume that the features and dependent variable are numeric. If we are looking at Y as a class, we can also see the distribution of different features for every class of Y.

Using Regression to Calculate Variable Importance

The summary function in regression also describes features and how they affect the dependent feature through significance. It works on variance and marks all features which are significantly important.

Such features usually have a p-value less than 0.05 which indicates that confidence in their significance is more than 95%.

Let us look at an example:

# Using the mlbench library to load diabetes data

library(mlbench)
data(PimaIndiansDiabetes)
data_lm = as.data.frame(PimaIndiansDiabetes)

# Fit a logistic regression model
fit_glm = glm(diabetes~.,data_lm,family = "binomial")

# generate summary

summary(fit_glm)
Call:
glm(formula = diabetes ~ ., family = "binomial", data = data_lm)
Deviance Residuals:
Min        1Q      Median   3Q          Max
-2.5566  -0.7274  -0.4159   0.7267   2.9297

Coefficients:
                 Estimate          Std. Error        z value            Pr(>|z|)
(Intercept)       -8.4046964      0.7166359       -11.728            < 2e-16           ***
pregnant          0.1231823       0.0320776       3.840               0.000123         ***
glucose           0.0351637       0.0037087       9.481               < 2e-16           ***
pressure          -0.0132955      0.0052336       -2.540              0.011072         *
triceps           0.0006190       0.0068994       0.090               0.928515        
insulin           -0.0011917      0.0009012       -1.322              0.186065        
mass              0.0897010       0.0150876       5.945               2.76e-09          ***
pedigree          0.9451797       0.2991475       3.160               0.001580         **
age               0.0148690       0.0093348       1.593               0.111192        
---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 993.48  on 767  degrees of freedom
Residual deviance: 723.45  on 759  degrees of freedom
AIC: 741.45

Number of Fisher Scoring iterations: 5

The output by logistic model gives us the estimates and probability values for each of the features. It also marks the important features with stars based on p-values.

For features whose class is a factor, the features are broken on the basis of each unique factor level. We see that the most important variables include glucose, mass and pregnant features for diabetes prediction. In this manner, regression models provide us with a list of important features.

Using The Caret Package to perform  variable importance

R has a caret package which includes the varImp() function to calculate important features of almost all models.

Let’s compare our previous model summary with the output of the varImp() function.

# Using varImp() function

library(caret)
varImp(fit_glm)

            Overall
pregnant          3.8401403
glucose           9.4813935
pressure          2.5404160
triceps           0.0897131
insulin           1.3223094
mass              5.9453340
pedigree          3.1595780
age               1.5928584

The varImp output ranks glucose to be the most important feature followed by mass and pregnant. This is exactly similar to the p-values of the logistic regression model.

However, varImp() function also works with other models such as random forests and can also give an idea of the relative importance using the importance score it generates.

Variable Importance Through Random Forest

Random forests are based on decision trees and use bagging to come up with a model over the data. Random forests also have a feature importance methodology which uses ‘gini index’ to assign a score and rank the features.

Let us see an example and compare it with varImp() function.

# Import the random forest library and fit a model

library(randomForest)
fit_rf = randomForest(diabetes~., data=data_lm)
# Create an importance based on mean decreasing gini
importance(fit_rf)

                          MeanDecreaseGini
pregnant                      29.11588
glucose                       91.17223
pressure                      30.88188
triceps                       23.91996
insulin                       24.79802
mass                          56.83389
pedigree                      42.83993
age                           47.12770

 

# compare the feature importance with varImp() function

varImp(fit_rf)

            Overall
pregnant          29.11588
glucose           91.17223
pressure          30.88188
triceps           23.91996
insulin           24.79802
mass              56.83389
pedigree          42.83993
age               47.12770

We see that the importance scores by varImp() function and the importance() function of random forest are exactly the same. If the model being used is random forest, we also have a function known as varImpPlot() to plot this data

# Create a plot of importance scores by random forest
varImpPlot(fit_rf)

These scores which are denoted as ‘Mean Decrease Gini’ by the importance measure represents how much each feature contributes to the homogeneity in the data. The way it works is as follows:

Each time a feature is used to split data at a node, the Gini index is calculated at the root node and at both the leaves. The Gini index represents the homogeneity and is 0 for completely homogeneous data and 1 for completely heterogeneous data.

The difference in the Gini index of the child nodes and the splitting root node is calculated for the feature and normalized.

Here, the nodes are also said to result in ‘purity’ of the data which means that the data is more easily classified. If the purity is high, the mean decrease in Gini index is also high.

Hence, the mean decrease in Gini index is highest for the most important feature.

Such features are useful in classifying the data and are likely to split the data into pure single class nodes when used at a node. Hence they are used first during splitting.

The overall mean decrease in Gini importance for each feature is thus calculated as the ratio of the sum of the number of splits in all trees that include the feature to the number of samples it splits.

This method is very useful to get importance scores and go a step further towards model interpretation.

Conclusion

Variable importance is usually followed by variable selection. Whether feature importance is generated before fitting the model (by methods such as correlation scores) or after fitting the model (by methods such as varImp() or Gini Importance), the important features not only give an insight on the features with high weightage and used frequently by the model but also the features which are slowing down our model.

This is why feature selection is used as it can improve the performance of the model. This is by removing predictors with chance or negative influence and provide faster and more cost-effective implementations by the decrease in the number of features going into the model.

To decide on the number of features to choose, one should come up with a number such that neither too few nor too many features are being used in the model.

For a methodology such as using correlation, features whose correlation is not significant and just by chance (say within the range of +/- 0.1 for a particular problem) can be removed.

For other methods such as scores by the varImp() function or importance() function of random forests, one should choose the features until which there is a sharp decline in importance scores.

In case of a large number of features (say hundreds or thousands), a more simplistic approach can be a cutoff score such as only the top 20 or top 25 features or the features such as the combined importance score crosses a threshold of 80% or 90% of the total importance score.

In the end, variable selection is a trade-off between the loss in complexity against the gain in execution speed that the project owners are comfortable with.

The methods mentioned in this article are meant to provide an overview of the ways in which variable importance can be calculated for a data. There can be other similar variable importance methods with their uses and implementations as per the situation.

Complete Code

# Use the library cluster generation to make a positive definite matrix of 15 features

library(clusterGeneration)
S = genPositiveDefMat("unifcorrmat",dim=15)
#create 15 features using multivariate normal distribution for 5000 datapoints
library(mnormt)

n = 5000
X = rmnorm(n,varcov=S$Sigma)

# Create a two class dependent variable using binomial distribution
Y = rbinom(n,size=1,prob=0.3)

data = data.frame(Y,X)
# Create a correlation table for Y versus all features

cor(data,data$Y)
# Using the mlbench library to load diabetes data
library(mlbench)
data(PimaIndiansDiabetes)
data_lm=as.data.frame(PimaIndiansDiabetes)
# Fit a logistic regression model
fit_glm=glm(diabetes~.,data_lm,family = "binomial")

# generate summary
summary(fit_glm)
# Using varImp() function
library(caret)
varImp(fit_glm)

#Import the random forest library and fit a model
library(randomForest)
fit_rf=randomForest(diabetes~., data=data_lm)
# Create an importance based on mean decreasing gini
importance(fit_rf)

# compare the feature importance with varImp() function
varImp(fit_rf)

# Create a plot of importance scores by random forest
varImpPlot(fit_rf)

Follow us:

FACEBOOKQUORA |TWITTERGOOGLE+ | LINKEDINREDDIT | FLIPBOARD | MEDIUM | GITHUB

I hope you like this post. If you have any questions, then feel free to comment below.  If you want me to write on one particular topic, then do tell it to me in the comments below.

Related Courses:

Author Bio:

This article was contributed by Perceptive Analytics. Madhur Modi, Chaitanya Sagar, Prudhvi Potuganti and Saneesh Veetil contributed to this article.

Perceptive Analytics provides data analytics, data visualization, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

Leave a Reply

Your email address will not be published. Required fields are marked *

>