Best Ways To Handle Imbalanced Data In Machine Learning

handle imbalanced data

Handling Imbalanced data with python

When dealing with any classification problem, we might not always get the target ratio in an equal manner. There will be situation where you will get data that was very imbalanced, i.e., not equal. In machine learning world we call this as class imbalanced data issue.

Building models for the balanced target data is more comfortable than handling imbalanced data; even the classification algorithms find it easier to learn from properly balanced data. 

But in real-world, the data is not always fruitful to build models easily. We need to handle unstructured data, and we need to handle imbalance data.

So as a data scientist or analyst, you need to know how to deal with class imbalance.

In this article, we are going to give insights about how to deal with this situation. There are various techniques used to handle imbalance data. Let's learn about them in detail along with implementation in python.

Best way to handle imbalanced data in machine learning

Click to Tweet

Before we go further, Let's look at the topics you will learn by the end of this article.

What is class Imbalance in machine learning?

In machine learning class imbalance is the issue of target class distribution. Will explain why we are saying it is an issue. If the target classes are not equally distributed or not in an equal ratio, we call the data having an imbalance data issue.

Examples of balanced and imbalanced datasets

Let me give an example of a target class balanced and imbalanced datasets, which helps in understanding about class imbalance datasets.

Balanced datasets:-

  • A random sampling of a coin trail
  • Classifying images to cat or dog
  • Sentiment analysis of movie reviews

Suppose you see in the above examples. For the balanced datasets, the target class distribution is nearly equal. 

For example, In the random coin trail, even the researchers say the probability of getting head is higher than the tail Still, the distribution of head and tail is nearly equal. It is the same with the movie review case too. 

Class Imbalance dataset:-

  • Email spam or ham dataset
  • Credit card fraud detection
  • Machine components failure detections
  • Network failure detections

But when it comes to the imbalanced dataset, the target distribution is not equal. For email spam or ham, distribution is not equal.

Just imagine how many emails we receive every day and how many were classified as spam. Google uses its email classifier to do that.

In general, out of 10 emails, we receive one will go to the spam folder, and the other emails will go to the inbox. Here the ham and spam ration is 9:1 In credit card fraud detection the ration will much lesser like 9.5: 5 

By now, we are clear about imbalanced data. Now, let’s learn why we need to balance data. In other words, why we need to handle the imbalanced data.

Why we have to balance the data?

The answer is quite simple, to make our predictions more accurate.  

Because if we have imbalanced data, the model is more biased to the dominant target class and tends to predict the target as the predominant target class.

Let say in the credit fraud detection out of 100 credit applications. Only 5 applications will fall into the fraud category. So any machine learning model will be tempted to predict the outcome against the fraud class. This means the model predicts the credit applicant is not a fraud.

The trained model predicting the dominant class is reasonable as all the machine learning models while learning to try to reduce the error as the minority classes are very less while leaning. It won’t consider reducing the errors for the minority class and always trying to get fewer errors for predicting the majority class.

So to handle these kinds of issues, we need to balance the data before building the models.

How to deal with imbalance data

To deal with imbalanced data issues, we need to convert imbalance to balance data in a meaningful way. Then we build the machine learning model on the balanced dataset.

In the later sections of this article, we will learn about different techniques to handle the imbalanced data.

Before that, we build a machine learning model on imbalanced data. Later we will apply different imbalance techniques.

So let’s get started.

Model on Imbalance data

About Dataset

We are taking this dataset from Kaggle, and you can download from this link 

The dataset contains one set of SMS messages in English of 5,574 messages, tagged according to ham (legitimate) or spam.

The files contain one message per line. Each line is composed of two columns: v1 includes the label (ham or spam), and v2 contains the raw text.columns.

The main task was to build a prediction model that will accurately classify which texts are spam?

load dataset

Let’s have a look at the loaded data fields.

dataset fields

We have the target variable v1, which contains the ham or spam and information, v2 having the actual SMS text. In addition to it, we also have some unnecessary fields. We will be removing them with the below code.

drop columns

We renamed the loaded data fields to

  • label
  • text
clean data

Data ratio

Using the seaborn countplot let's visualize the ham and spam targets ration.

data ratio
  • Ham messages : 87%
  • Spam messages : 13%

We can clearly see how the data was imbalanced, before going to create a model we need to do data preprocessing.

Data Preprocessing

When we are dealing with text data, first we need to preprocess the text and then convert it into vectors.

data preprocessing
  • Stemming is actually removing the suffix from a word and reducing it to its root word. First use stemming technique on text to convert into its root word.

  • We generally get text mixed up with a lot of special characters,numerical, etc. we need to take care of removing unwanted text from data. Use regular expressions to replace all the unnecessary data with spaces

  • Convert all the text into lowercase to avoid getting different vectors for the same word . Eg: and, And ------------> and

  • Remove stopWords - “stop words”  typically  refers to the most common words in a language, Eg: he, is, at etc.  We need to filter stopwords

  • Split the sentence into words

  • Extract the text except for stopwords

  • Again join them into sentences

  • Append the cleaned text into a list (corpus)

  • Now our text is ready , convert the text into vectors using Countvectorizer

  • Convert target label into categorical

Model Creation

First, we simply create the model with unbalanced data, then after try with different balancing techniques.

model building

Let us check the accuracy of the model.

accuracy of model

We got an accuracy of 0.98, which was almost biased.

Now we will learn how to handle imbalance data with different imbalanced techniques in the next section of the article.

Techniques for handling imbalanced data

For handling imbalance data we are having many other ways, In this article, we will learn about the below techniques along with the code implementation.

  1. Oversampling
  2. Undersampling
  3. Ensemble Techniques

In this article we will be focusing only on the first 2 methods for handling imbalance data.

OverSampling

oversampling

In oversampling, we increase the number of samples in minority class to match up to the number of samples of the majority class.

In simple terms, you take the minority class and try to create new samples that could match up to the length of the majority samples.

Let me explain in a much better way.

E.g., Suppose we have a data with 100 labels with 0’s and 900 labels with 1’s, here the minority class 0’s, what we do is we increase the data 9:1 ratio, i.e., for everyone data point it will increase 9 times results in creating new 9 data points on that top of one point.

Mathematically:

1 label --------------> 900 data  points

0 label ---------------> 100 data points

 + 800 points

-----------------------------------------------------------

      900 data points

Now the data ratio is 1:1 ,

1 label ------>900 data points

0 label ------> 900 data points

Oversampling Implementation

We can implement in two ways,

  1. RandomOverSampler method
  2. SMOTETomek method

First, we have to install imblearn library, to install enter below command in cmd

Command:  pip install imbalanced-learn

RandomOverSampler

It is the most sophisticated method of oversampling to randomly sample the minority classes and simply duplicate the sampled observations.

RandomOversampler Implementation in python

random over sampler

  Here,

  • x is an independent features 
  • y is a dependent feature 

If you want to check the samples count before and after oversampling, run the below code.

random over sampler output

SMOTETomek

Synthetic Minority Over-sampling Technique(SMOTE) is a technique that generates new observations by interposing between observations in the existing data.

In Simple terms, It is a technique used to generate new data points for the minority classes based on existing data. 

Smotetomek implementation in python

smotetomek code

Here ,

  • x is a set of independent features
  • y is a dependent feature 

If you want to check the samples count before and after oversampling, run the below code.

smotetomek output

Now let’s implement the same model, with the oversampled data.

model with random oversampled

Let’s check the accuracy of the model.

random oversampling model accuracy

We can see we got a very good accuracy for balanced data, tp and tf are increased. Where 

  • TP: Ture Positive
  • TF: Ture Negative

The tp and tf are the components from the confusion matrix.

 
 

Oversampling pros and cons

Below are the listed pros and cons of using the oversampling technique.

Pros:

  • This method doesn’t lead to information loss.
  • Performs  well and gives good accuracy.
  • It creates new synthetic data points with the nearest neighbours from existing data.

Cons:

  • Increase the size of data takes high time for training.
  • It may also lead to overfitting since it is replicating the minor classes.
  • Need extra storage.

UnderSampling

undersampling

In undersampling, we decrease the number of samples in the majority class to match the number of samples of the minority class.

In brief, you take the majority class and try to create new samples that match the length of the minority samples.

Let me explain in a much better way

E.g., Suppose we have a data with 100 labels with 0’s and 900 labels with 1’s, here the minority class 0’s, what we do is we balance the data from 9:1 ratio to 1:1 ratio i.e., We randomly select 100 data points out of 900 data points in majority class. Results in 1: 1 ratio, i.e.,

1 label ----------------> 100 data points

0 label -----------------> 100 data points

Undersampling Implementation

We can implement in two different ways,

  1. RandomunderSampler method
  2. NearMiss  method

Random undersampling Implementation

It simply samples the majority class at random until it reaches a similar number of observations as the minority classes.

random under sample code

Here,

  • x is independent features.
  • y is a dependent feature.

If you want to check the samples count before and after undersampling, run the below code.

random under sampler output

NearMiss Implementation

It selects samples from the majority class for which the average distance of the N closet samples of a majority class is smallest.

under sampling with nearness

Here,

  • x is independent features
  • y is a dependent feature

If you want to check the samples count before and after undersampling, run the below code.

under sampling nearmiss output

Now we will implement the model using the undersampling data.

model with under sampling data

Now let’s check the accuracy of the model.

unders sampling model accuracy

Under-sampling gives less accuracy for smaller datasets because you are actually dropping the information. Use this method only if one has a huge dataset.

 
 

Undersampling pros and cons

Below are the listed pros and corns of using the undersampling techniques

Pros:

  • Reduces storage problems, easy to train
  • In most cases it creates a balanced subset that carries the greatest potential for representing the larger group as a whole.
  • It produces a simple random sample which is much less complicated than other techniques.

Cons:

  • It can ignore potentially useful information which could be important for building  classifiers.
  • The sample chosen by random under-sampling may be a biased sample, resulting in inaccurate results with the actual test data.
  • Loss of useful information of the majority class.

When to use oversampling VS undersampling

We have a fair amount of knowledge on these two data imbalance handling techniques, but we use them as both the methods are for handling the imbalanced data issue.

  • Oversampling: We will use oversampling when we are having a limited amount of data.
  • Undersampling: We will use undersampling when we have huge data and undersampling the majority call won't effect the data.

Complete Code

The complete code is placed below, you can also fork the code in our Github repo.

Conclusion

When handling imbalanced datasets, there is no one right solution to improve the accuracy of the prediction model. We need to try out multiple methods to figure out the best-suited sampling techniques for the dataset.

Depending on the characteristics of the imbalanced data set, the most effective techniques will vary. In most cases, synthetic techniques like SMOTE will outperform conventional oversampling and undersampling methods.

For better results, we can use synthetic sampling methods like SMOTE and advanced boosting and ensemble algorithms.

Recommended Courses

Recommended
educative-machine-learning

Machine Learning For Engineers

Rating: 4.5/5

supervised learning

Supervised Learning Algorithms

Rating: 4/5

Machine learning

Machine Learning with Python

Rating: 4.5/5

Follow us:

FACEBOOKQUORA |TWITTERGOOGLE+ | LINKEDINREDDIT FLIPBOARD | MEDIUM | GITHUB

I hope you like this post. If you have any questions ? or want me to write an article on a specific topic? then feel free to comment below.

4 Responses to “Best Ways To Handle Imbalanced Data In Machine Learning

  • Hi Saimadhu, nice article, well written. I would however like to note that in most cases, unbalanced data is not a problem and balancing the data often does more harm than good. Even in your example, the unbalanced model is the best in predicting the spam cases, compared to the balanced alternatives. In my experience, building predictive models for many companies in numerous contexts, balancing the data is most often not needed and therefore not wise to do since you either loose information (undersampling) or add noise / overfitting (oversampling / smote) to the data.

    • Hi Jurraiaan Nagelkerke,

      Thanks for sharing your views. Yes, you are correct, the example showed in the article spam or ham is having a decent amount of accuracy compared to the model build after handling the imbalance data. The intention is to explain how we can handle imbalance data. In the industry, we face the target data issues like in the ration of 98: 2 cases, The main intention of the article is to explain how we can use different imbalance data techniques.

  • Really useful information on how to handle imbalance data. Thanks for posting 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

>