Best Ways To Handle Imbalanced Data In Machine Learning

Handling Imbalanced data with python
When dealing with any classification problem, we might not always get the target ratio in an equal manner. There will be situation where you will get data that was very imbalanced, i.e., not equal. In machine learning world we call this as class imbalanced data issue.
Building models for the balanced target data is more comfortable than handling imbalanced data; even the classification algorithms find it easier to learn from properly balanced data.
But in real-world, the data is not always fruitful to build models easily. We need to handle unstructured data, and we need to handle imbalance data.
So as a data scientist or analyst, you need to know how to deal with class imbalance.
In this article, we are going to give insights about how to deal with this situation. There are various techniques used to handle imbalance data. Let's learn about them in detail along with implementation in python.
Best way to handle imbalanced data in machine learning
Before we go further, Let's look at the topics you will learn by the end of this article.
What is class Imbalance in machine learning?
In machine learning class imbalance is the issue of target class distribution. Will explain why we are saying it is an issue. If the target classes are not equally distributed or not in an equal ratio, we call the data having an imbalance data issue.
Examples of balanced and imbalanced datasets
Let me give an example of a target class balanced and imbalanced datasets, which helps in understanding about class imbalance datasets.
Balanced datasets:-
- A random sampling of a coin trail
- Classifying images to cat or dog
- Sentiment analysis of movie reviews
Suppose you see in the above examples. For the balanced datasets, the target class distribution is nearly equal.
For example, In the random coin trail, even the researchers say the probability of getting head is higher than the tail Still, the distribution of head and tail is nearly equal. It is the same with the movie review case too.
Class Imbalance dataset:-
- Email spam or ham dataset
- Credit card fraud detection
- Machine components failure detections
- Network failure detections
But when it comes to the imbalanced dataset, the target distribution is not equal. For email spam or ham, distribution is not equal.
Just imagine how many emails we receive every day and how many were classified as spam. Google uses its email classifier to do that.
In general, out of 10 emails, we receive one will go to the spam folder, and the other emails will go to the inbox. Here the ham and spam ration is 9:1 In credit card fraud detection the ration will much lesser like 9.5: 5
By now, we are clear about imbalanced data. Now, let’s learn why we need to balance data. In other words, why we need to handle the imbalanced data.
Why we have to balance the data?
The answer is quite simple, to make our predictions more accurate.
Because if we have imbalanced data, the model is more biased to the dominant target class and tends to predict the target as the predominant target class.
Let say in the credit fraud detection out of 100 credit applications. Only 5 applications will fall into the fraud category. So any machine learning model will be tempted to predict the outcome against the fraud class. This means the model predicts the credit applicant is not a fraud.
The trained model predicting the dominant class is reasonable as all the machine learning models while learning to try to reduce the error as the minority classes are very less while leaning. It won’t consider reducing the errors for the minority class and always trying to get fewer errors for predicting the majority class.
So to handle these kinds of issues, we need to balance the data before building the models.
How to deal with imbalance data
To deal with imbalanced data issues, we need to convert imbalance to balance data in a meaningful way. Then we build the machine learning model on the balanced dataset.
In the later sections of this article, we will learn about different techniques to handle the imbalanced data.
Before that, we build a machine learning model on imbalanced data. Later we will apply different imbalance techniques.
So let’s get started.
Model on Imbalance data
About Dataset
We are taking this dataset from Kaggle, and you can download from this link
The dataset contains one set of SMS messages in English of 5,574 messages, tagged according to ham (legitimate) or spam.
The files contain one message per line. Each line is composed of two columns: v1 includes the label (ham or spam), and v2 contains the raw text.columns.
The main task was to build a prediction model that will accurately classify which texts are spam?




Let’s have a look at the loaded data fields.




We have the target variable v1, which contains the ham or spam and information, v2 having the actual SMS text. In addition to it, we also have some unnecessary fields. We will be removing them with the below code.




We renamed the loaded data fields to
- label
- text




Data ratio
Using the seaborn countplot let's visualize the ham and spam targets ration.




- Ham messages : 87%
- Spam messages : 13%
We can clearly see how the data was imbalanced, before going to create a model we need to do data preprocessing.
Data Preprocessing
When we are dealing with text data, first we need to preprocess the text and then convert it into vectors.




Stemming is actually removing the suffix from a word and reducing it to its root word. First use stemming technique on text to convert into its root word.
We generally get text mixed up with a lot of special characters,numerical, etc. we need to take care of removing unwanted text from data. Use regular expressions to replace all the unnecessary data with spaces
Convert all the text into lowercase to avoid getting different vectors for the same word . Eg: and, And ------------> and
Remove stopWords - “stop words” typically refers to the most common words in a language, Eg: he, is, at etc. We need to filter stopwords
Split the sentence into words
Extract the text except for stopwords
Again join them into sentences
Append the cleaned text into a list (corpus)
Now our text is ready , convert the text into vectors using Countvectorizer
Convert target label into categorical
Model Creation
First, we simply create the model with unbalanced data, then after try with different balancing techniques.




Let us check the accuracy of the model.




We got an accuracy of 0.98, which was almost biased.
Now we will learn how to handle imbalance data with different imbalanced techniques in the next section of the article.
Techniques for handling imbalanced data
For handling imbalance data we are having many other ways, In this article, we will learn about the below techniques along with the code implementation.
- Oversampling
- Undersampling
- Ensemble Techniques
In this article we will be focusing only on the first 2 methods for handling imbalance data.
OverSampling




In oversampling, we increase the number of samples in minority class to match up to the number of samples of the majority class.
In simple terms, you take the minority class and try to create new samples that could match up to the length of the majority samples.
Let me explain in a much better way.
E.g., Suppose we have a data with 100 labels with 0’s and 900 labels with 1’s, here the minority class 0’s, what we do is we increase the data 9:1 ratio, i.e., for everyone data point it will increase 9 times results in creating new 9 data points on that top of one point.
Mathematically:
1 label --------------> 900 data points
0 label ---------------> 100 data points
+ 800 points
-----------------------------------------------------------
900 data points
Now the data ratio is 1:1 ,
1 label ------>900 data points
0 label ------> 900 data points
Oversampling Implementation
We can implement in two ways,
- RandomOverSampler method
- SMOTETomek method
First, we have to install imblearn library, to install enter below command in cmd
Command: pip install imbalanced-learn
RandomOverSampler
It is the most sophisticated method of oversampling to randomly sample the minority classes and simply duplicate the sampled observations.
RandomOversampler Implementation in python




Here,
- x is an independent features
- y is a dependent feature
If you want to check the samples count before and after oversampling, run the below code.




SMOTETomek
Synthetic Minority Over-sampling Technique(SMOTE) is a technique that generates new observations by interposing between observations in the existing data.
In Simple terms, It is a technique used to generate new data points for the minority classes based on existing data.
Smotetomek implementation in python




Here ,
- x is a set of independent features
- y is a dependent feature
If you want to check the samples count before and after oversampling, run the below code.




Now let’s implement the same model, with the oversampled data.




Let’s check the accuracy of the model.




We can see we got a very good accuracy for balanced data, tp and tf are increased. Where
- TP: Ture Positive
- TF: Ture Negative
The tp and tf are the components from the confusion matrix.
Oversampling pros and cons
Below are the listed pros and cons of using the oversampling technique.
UnderSampling




In undersampling, we decrease the number of samples in the majority class to match the number of samples of the minority class.
In brief, you take the majority class and try to create new samples that match the length of the minority samples.
Let me explain in a much better way
E.g., Suppose we have a data with 100 labels with 0’s and 900 labels with 1’s, here the minority class 0’s, what we do is we balance the data from 9:1 ratio to 1:1 ratio i.e., We randomly select 100 data points out of 900 data points in majority class. Results in 1: 1 ratio, i.e.,
1 label ----------------> 100 data points
0 label -----------------> 100 data points
Undersampling Implementation
We can implement in two different ways,
- RandomunderSampler method
- NearMiss method
Random undersampling Implementation
It simply samples the majority class at random until it reaches a similar number of observations as the minority classes.




Here,
- x is independent features.
- y is a dependent feature.
If you want to check the samples count before and after undersampling, run the below code.




NearMiss Implementation
It selects samples from the majority class for which the average distance of the N closet samples of a majority class is smallest.




Here,
- x is independent features
- y is a dependent feature
If you want to check the samples count before and after undersampling, run the below code.




Now we will implement the model using the undersampling data.




Now let’s check the accuracy of the model.




Under-sampling gives less accuracy for smaller datasets because you are actually dropping the information. Use this method only if one has a huge dataset.
Undersampling pros and cons
Below are the listed pros and corns of using the undersampling techniques
When to use oversampling VS undersampling
We have a fair amount of knowledge on these two data imbalance handling techniques, but we use them as both the methods are for handling the imbalanced data issue.
- Oversampling: We will use oversampling when we are having a limited amount of data.
- Undersampling: We will use undersampling when we have huge data and undersampling the majority call won't effect the data.
Complete Code
The complete code is placed below, you can also fork the code in our Github repo.
Conclusion
When handling imbalanced datasets, there is no one right solution to improve the accuracy of the prediction model. We need to try out multiple methods to figure out the best-suited sampling techniques for the dataset.
Depending on the characteristics of the imbalanced data set, the most effective techniques will vary. In most cases, synthetic techniques like SMOTE will outperform conventional oversampling and undersampling methods.
For better results, we can use synthetic sampling methods like SMOTE and advanced boosting and ensemble algorithms.
Recommended Courses
Hi Saimadhu, nice article, well written. I would however like to note that in most cases, unbalanced data is not a problem and balancing the data often does more harm than good. Even in your example, the unbalanced model is the best in predicting the spam cases, compared to the balanced alternatives. In my experience, building predictive models for many companies in numerous contexts, balancing the data is most often not needed and therefore not wise to do since you either loose information (undersampling) or add noise / overfitting (oversampling / smote) to the data.
Hi Jurraiaan Nagelkerke,
Thanks for sharing your views. Yes, you are correct, the example showed in the article spam or ham is having a decent amount of accuracy compared to the model build after handling the imbalance data. The intention is to explain how we can handle imbalance data. In the industry, we face the target data issues like in the ration of 98: 2 cases, The main intention of the article is to explain how we can use different imbalance data techniques.
Really useful information on how to handle imbalance data. Thanks for posting 🙂
Hi,
Thanks for the compliment.
We wish you a very happy learning.