Best Ways To Handle Imbalanced Data In Machine Learning

handle imbalanced data

Did you know that a staggering 75% of machine learning projects fail to make it into production? One of the primary reasons behind this is the prevalence of imbalanced data. 

Imbalanced data is a common issue in machine learning, where the distribution of classes in the target variable is uneven. In this scenario, machine learning models tend to favor the majority class while neglecting the minority class, leading to poor performance and biased predictions. 

Building models for the balanced target data is more comfortable than handling imbalanced data; even the classification algorithms find it easier to learn from properly balanced data. 

But in real-world, the data is not always fruitful to build models easily. We need to handle unstructured data, and we need to handle imbalance data.

So as a data scientist or analyst, you need to know how to deal with class imbalance.

This article delves into the challenges posed by imbalanced data, its impact on model performance, and the various techniques that can be employed to handle such situations effectively.

By addressing imbalanced data, we can significantly improve the success rate of machine learning projects and ensure that our models are robust and reliable.

Best way to handle imbalanced data in machine learning

Click to Tweet

Before we go further, Let's look at the topics you will learn by the end of this article.

What is Class Imbalance data in Machine Learning?

In machine learning class imbalance is the issue of target class distribution. Will explain why we are saying it is an issue. If the target classes are not equally distributed or not in an equal ratio, we call the data having an imbalance data issue.

Examples of balanced and imbalanced datasets

Let me give an example of a target class balanced and imbalanced datasets, which helps in understanding about class imbalance datasets.

Balanced datasets:-

  • A random sampling of a coin trail
  • Classifying images to cat or dog
  • Sentiment analysis of movie reviews

Suppose you see in the above examples. For the balanced datasets, the target class distribution is nearly equal. 

For example, In the random coin trail, even the researchers say the probability of getting head is higher than the tail Still, the distribution of head and tail is nearly equal. It is the same with the movie review case too. 

Class Imbalance dataset:-

  • Email spam or ham dataset
  • Credit card fraud detection
  • Machine components failure detections
  • Network failure detections

But when it comes to the imbalanced dataset, the target distribution is not equal. For email spam or ham, distribution is not equal.

Just imagine how many emails we receive every day and how many were classified as spam. Google uses its email classifier to do that.

In general, out of 10 emails, we receive one will go to the spam folder, and the other emails will go to the inbox. Here the ham and spam ration is 9:1 In credit card fraud detection the ration will much lesser like 9.5: 5 

By now, we are clear about imbalanced data. Now, let’s learn why we need to balance data. In other words, why we need to handle the imbalanced data.

Understanding Imbalanced Data and Its Impact on Machine Learning Models

Why we need to balance data before building the machine learning models ?
The answer is quite simple, to make our predictions more accurate.  

Because if we have imbalanced data, the model is more biased to the dominant target class and tends to predict the target as the predominant target class.

Let say in the credit fraud detection out of 100 credit applications. Only 5 applications will fall into the fraud category. So any machine learning model will be tempted to predict the outcome against the fraud class. This means the model predicts the credit applicant is not a fraud.

The trained model predicting the dominant class is reasonable as all the machine learning models while learning to try to reduce the error as the minority classes are very less while leaning. It won’t consider reducing the errors for the minority class and always trying to get fewer errors for predicting the majority class.

This Imbalanced data is a common issue in the realm of machine learning, where the distribution of classes in the target variable is uneven. This disproportion of classes can lead to a bias in machine learning models, as they tend to prioritize the majority class while neglecting the minority class. 

Causes of imbalanced data:

  1. Nature of the data: In certain real-world scenarios, the data is naturally imbalanced. For instance, in fraud detection, the number of fraudulent transactions is significantly lower than the number of legitimate transactions. Similarly, in medical diagnosis, the occurrence of rare diseases is much less frequent than common ailments.
  2. Data collection bias: Imbalanced data can also arise due to biases in data collection. This can happen when data is collected from sources that are not representative of the entire population, leading to an over-representation of one class and under-representation of another.
  3. Insufficient data: Another cause of imbalanced data is the lack of sufficient data for the minority class. This can happen when data collection is difficult, time-consuming, or expensive, leading to a smaller sample size for the minority class.

Effects of imbalanced data on model performance:

  1. Model bias: When training on imbalanced data, machine learning models tend to favor the majority class, as they are optimized to minimize the overall error rate. This results in a biased model that performs poorly on the minority class.
  2. Inaccurate performance metrics: Imbalanced data can also lead to misleading performance metrics. For example, in a dataset with a 95% majority class and a 5% minority class, a model that always predicts the majority class will achieve a 95% accuracy rate. However, this high accuracy rate is misleading, as the model fails to capture the minority class, which is often the class of interest.
  3. Overfitting: Imbalanced data can cause overfitting, where the model learns to memorize the majority class and fails to generalize to new, unseen data. This results in poor performance on real-world applications, as the model cannot adapt to variations in the data.

To address these challenges and improve the performance of machine learning models, various techniques can be employed to handle imbalanced data effectively. 

By understanding the causes and effects of imbalanced data, we can make informed decisions about the appropriate strategies to mitigate its impact and ensure that our models are both robust and reliable. So to handle these kinds of issues, we need to balance the data before building the models.

How to deal with imbalance data

To deal with imbalanced data issues, we need to convert imbalance to balance data in a meaningful way. Then we build the machine learning model on the balanced dataset.

In the later sections of this article, we will learn about different techniques to handle the imbalanced data.

Before that, we build a machine learning model on imbalanced data. Later we will apply different imbalance techniques.

So let’s get started.

Model on Imbalance data

About Dataset

We are taking this dataset from Kaggle, and you can download from this link 

The dataset contains one set of SMS messages in English of 5,574 messages, tagged according to ham (legitimate) or spam.

The files contain one message per line. Each line is composed of two columns: v1 includes the label (ham or spam), and v2 contains the raw text.columns.

The main task was to build a prediction model that will accurately classify which texts are spam?

load dataset

Let’s have a look at the loaded data fields.

dataset fields

We have the target variable v1, which contains the ham or spam and information, v2 having the actual SMS text. In addition to it, we also have some unnecessary fields. We will be removing them with the below code.

drop columns

We renamed the loaded data fields to

  • label
  • text
clean data

Data ratio

Using the seaborn countplot let's visualize the ham and spam targets ration.

data ratio
  • Ham messages : 87%
  • Spam messages : 13%

We can clearly see how the data was imbalanced, before going to create a model we need to do data preprocessing.

Data Preprocessing

When we are dealing with text data, first we need to preprocess the text and then convert it into vectors.

data preprocessing
  • Stemming is actually removing the suffix from a word and reducing it to its root word. First use stemming technique on text to convert into its root word.

  • We generally get text mixed up with a lot of special characters,numerical, etc. we need to take care of removing unwanted text from data. Use regular expressions to replace all the unnecessary data with spaces

  • Convert all the text into lowercase to avoid getting different vectors for the same word . Eg: and, And ------------> and

  • Remove stopWords - “stop words”  typically  refers to the most common words in a language, Eg: he, is, at etc.  We need to filter stopwords

  • Split the sentence into words

  • Extract the text except for stopwords

  • Again join them into sentences

  • Append the cleaned text into a list (corpus)

  • Now our text is ready , convert the text into vectors using Countvectorizer

  • Convert target label into categorical

Model Creation

First, we simply create the model with unbalanced data, then after try with different balancing techniques.

model building

Let us check the accuracy of the model.

accuracy of model

We got an accuracy of 0.98, which was almost biased.

Now we will learn how to handle imbalance data with different imbalanced techniques in the next section of the article.

Techniques for handling imbalanced data

For handling imbalance data we are having many other ways, In this article, we will learn about the below techniques along with the code implementation.

  1. Oversampling
  2. Undersampling
  3. Ensemble Techniques

In this article we will be focusing only on the first 2 methods for handling imbalance data.

OverSampling

oversampling

In oversampling, we increase the number of samples in minority class to match up to the number of samples of the majority class.

In simple terms, you take the minority class and try to create new samples that could match up to the length of the majority samples.

Let me explain in a much better way.

E.g., Suppose we have a data with 100 labels with 0’s and 900 labels with 1’s, here the minority class 0’s, what we do is we increase the data 9:1 ratio, i.e., for everyone data point it will increase 9 times results in creating new 9 data points on that top of one point.

Mathematically:

1 label --------------> 900 data  points

0 label ---------------> 100 data points

 + 800 points

-----------------------------------------------------------

      900 data points

Now the data ratio is 1:1 ,

1 label ------>900 data points

0 label ------> 900 data points

Oversampling Implementation

We can implement in two ways,

  1. RandomOverSampler method
  2. SMOTETomek method

First, we have to install imblearn library, to install enter below command in cmd

Command:  pip install imbalanced-learn

RandomOverSampler

It is the most sophisticated method of oversampling to randomly sample the minority classes and simply duplicate the sampled observations.

RandomOversampler Implementation in python

random over sampler

  Here,

  • x is an independent features 
  • y is a dependent feature 

If you want to check the samples count before and after oversampling, run the below code.

random over sampler output

SMOTETomek

Synthetic Minority Over-sampling Technique(SMOTE) is a technique that generates new observations by interposing between observations in the existing data.

In Simple terms, It is a technique used to generate new data points for the minority classes based on existing data. 

Smotetomek implementation in python

smotetomek code

Here ,

  • x is a set of independent features
  • y is a dependent feature 

If you want to check the samples count before and after oversampling, run the below code.

smotetomek output

Now let’s implement the same model, with the oversampled data.

model with random oversampled

Let’s check the accuracy of the model.

random oversampling model accuracy

We can see we got a very good accuracy for balanced data, tp and tf are increased. Where 

  • TP: Ture Positive
  • TF: Ture Negative

The tp and tf are the components from the confusion matrix.

 
 

Oversampling pros and cons

Below are the listed pros and cons of using the oversampling technique.

Pros:

  • This method doesn’t lead to information loss.
  • Performs  well and gives good accuracy.
  • It creates new synthetic data points with the nearest neighbours from existing data.

Cons:

  • Increase the size of data takes high time for training.
  • It may also lead to overfitting since it is replicating the minor classes.
  • Need extra storage.

UnderSampling

undersampling

In undersampling, we decrease the number of samples in the majority class to match the number of samples of the minority class.

In brief, you take the majority class and try to create new samples that match the length of the minority samples.

Let me explain in a much better way

E.g., Suppose we have a data with 100 labels with 0’s and 900 labels with 1’s, here the minority class 0’s, what we do is we balance the data from 9:1 ratio to 1:1 ratio i.e., We randomly select 100 data points out of 900 data points in majority class. Results in 1: 1 ratio, i.e.,

1 label ----------------> 100 data points

0 label -----------------> 100 data points

Undersampling Implementation

We can implement in two different ways,

  1. RandomunderSampler method
  2. NearMiss  method

Random undersampling Implementation

It simply samples the majority class at random until it reaches a similar number of observations as the minority classes.

random under sample code

Here,

  • x is independent features.
  • y is a dependent feature.

If you want to check the samples count before and after undersampling, run the below code.

random under sampler output

NearMiss Implementation

It selects samples from the majority class for which the average distance of the N closet samples of a majority class is smallest.

under sampling with nearness

Here,

  • x is independent features
  • y is a dependent feature

If you want to check the samples count before and after undersampling, run the below code.

under sampling nearmiss output

Now we will implement the model using the undersampling data.

model with under sampling data

Now let’s check the accuracy of the model.

unders sampling model accuracy

Under-sampling gives less accuracy for smaller datasets because you are actually dropping the information. Use this method only if one has a huge dataset.

 
 

Undersampling pros and cons

Below are the listed pros and corns of using the undersampling techniques

Pros:

  • Reduces storage problems, easy to train
  • In most cases it creates a balanced subset that carries the greatest potential for representing the larger group as a whole.
  • It produces a simple random sample which is much less complicated than other techniques.

Cons:

  • It can ignore potentially useful information which could be important for building  classifiers.
  • The sample chosen by random under-sampling may be a biased sample, resulting in inaccurate results with the actual test data.
  • Loss of useful information of the majority class.

When to use oversampling VS undersampling

We have a fair amount of knowledge on these two data imbalance handling techniques, but we use them as both the methods are for handling the imbalanced data issue.

  • Oversampling: We will use oversampling when we are having a limited amount of data.
  • Undersampling: We will use undersampling when we have huge data and undersampling the majority call won't effect the data.

Advanced Resampling Techniques

Resampling techniques are essential tools for addressing the issue of imbalanced data in machine learning. These methods modify the original dataset to create a balanced representation of the majority and minority classes, which helps improve model performance.

In this section, we will discuss three advanced resampling techniques: 

  1. Synthetic Minority Over-sampling Technique (SMOTE),
  2. Adaptive Synthetic (ADASYN),
  3. Tomek Links.

Synthetic Minority Over-sampling Technique (SMOTE)

SMOTE is an oversampling technique that generates synthetic samples for the minority class. The algorithm works by selecting a random instance from the minority class and finding its k-nearest neighbors.

A synthetic instance is then created by interpolating between the selected instance and one of its neighbors. This process is repeated until the desired number of synthetic samples is generated, leading to a balanced dataset. SMOTE helps improve the model's ability to generalize and reduces the risk of overfitting.

Adaptive Synthetic (ADASYN)

ADASYN is an extension of the SMOTE technique, designed to adaptively generate synthetic samples based on the local density of the minority class instances.

The main idea behind ADASYN is to generate more synthetic samples for instances that are harder to classify, thus emphasizing the learning of the model in more challenging regions. This method calculates the number of synthetic instances to be generated for each minority instance based on the ratio of its nearest neighbors that belong to the majority class.

By focusing on areas where the model struggles, ADASYN helps enhance the model's performance on the minority class.

Tomek Links

Tomek Links is an undersampling technique that aims to remove instances from the majority class that are close to the minority class instances. A pair of instances from different classes is considered a Tomek Link if there is no instance of either class closer to each other than these two instances.

By removing the majority class instances that form Tomek Links, the algorithm reduces the overlap between the classes and improves the decision boundary, leading to better classification performance. 

By employing these methods, we can create balanced datasets that enable machine learning models to perform better on both majority and minority classes, resulting in more accurate and reliable predictions.

Cost-Sensitive Learning for Imbalanced Data

Cost-sensitive learning is an alternative approach to addressing imbalanced data in machine learning models. Instead of modifying the dataset itself, cost-sensitive learning focuses on adjusting the learning process to account for the different costs associated with misclassification errors. 

Incorporating Misclassification Costs

In cost-sensitive learning, we assign different costs to misclassifying instances from different classes. This way, the learning algorithm becomes aware of the imbalance and can prioritize minimizing the cost of misclassification errors, rather than simply maximizing the accuracy.

By assigning a higher cost to misclassifying minority class instances, the model is encouraged to pay more attention to them during the learning process.

Cost-Sensitive Algorithms

Several machine learning algorithms can be adapted to incorporate cost-sensitive learning. Some popular cost-sensitive algorithms include:

  • Cost-Sensitive Decision Trees: Decision tree algorithms can be modified to use misclassification costs when choosing the best split at each node. This encourages the tree to prioritize splits that minimize the overall cost of misclassification.
  • Cost-Sensitive Support Vector Machines (SVM): In SVM, the cost parameter C can be adjusted separately for each class, making the model more sensitive to instances from the minority class.
  • Cost-Sensitive Logistic Regression: Logistic regression can be modified by incorporating class weights into the objective function, which penalizes the model more for misclassifying instances from the minority class.

Advantages and Limitations of Cost-Sensitive Learning

Advantages:

  • Cost-sensitive learning directly addresses the issue of imbalanced data by prioritizing the minority class during the learning process.
  • Unlike resampling techniques, cost-sensitive learning does not modify the dataset, thus preserving the original distribution of the data.
  • It is applicable to various machine learning algorithms and can be easily incorporated into existing models.

Limitations:

  • Determining appropriate misclassification costs can be challenging and may require domain knowledge or trial-and-error experimentation.
  • Cost-sensitive learning may not be suitable for all types of data or problems, particularly when the minority class instances are too few or too noisy.

By incorporating misclassification costs and utilizing cost-sensitive algorithms, machine learning models can achieve better performance on imbalanced datasets and produce more reliable predictions. 

However, it is essential to carefully consider the advantages and limitations of cost-sensitive learning and choose the best approach based on the specific problem and dataset at hand.

Ensemble Methods for Imbalanced Data

Ensemble methods, which combine multiple base learners to produce a more accurate and robust model, can be adapted to handle imbalanced data effectively. In this section, we will discuss adapting bagging and boosting for imbalanced data, as well as the Balanced Random Forest algorithm.

Adapting Bagging for Imbalanced Data

Bagging, or bootstrap aggregating, involves training multiple base learners on different random subsets of the data and combining their predictions. To adapt bagging for imbalanced data, two common techniques are used: underbagging and overbagging.

  • Underbagging: In underbagging, random subsets of the majority class are selected and combined with the complete minority class to form new balanced datasets. Base learners are then trained on these balanced datasets and combined using majority voting or averaging.
  • Overbagging: Overbagging involves oversampling the minority class to balance the class distribution in the training subsets. This can be done by duplicating minority class instances or using synthetic samples generated by techniques like SMOTE.

Adapting Boosting for Imbalanced Data

Boosting is an ensemble method that trains a series of base learners sequentially, where each learner tries to correct the errors made by its predecessor.

To adapt boosting for imbalanced data, adjustments can be made to the sample weights or the loss function to account for the class imbalance.

Balanced Random Forest Algorithm

The Balanced Random Forest algorithm is an extension of the Random Forest algorithm that handles imbalanced data by incorporating both underbagging and random feature selection. 

This method creates balanced bootstrapped samples from the original data, and each tree in the ensemble is trained on one of these balanced samples.

Deep Learning Approaches for Imbalanced Data

Deep learning models, such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), can also be adapted to handle imbalanced data effectively. In this section, we will discuss adapting these models and incorporating class weighting.

Convolutional Neural Networks (CNN) for Imbalanced Data

CNNs can be adapted for imbalanced data by applying data augmentation techniques to the minority class, which helps increase the number of training samples and balance the class distribution.

Additionally, class weighting can be applied to the loss function, which penalizes the model more for misclassifying instances from the minority class.

Recurrent Neural Networks (RNN) for Imbalanced Data

Similar to CNNs, RNNs can also benefit from data augmentation and class weighting to handle imbalanced data. Furthermore, RNNs can be combined with other techniques such as oversampling and ensemble methods to improve their performance on imbalanced datasets.

Class Weighting in Deep Learning Models

Class weighting can be applied to the loss function of deep learning models to account for imbalanced data. By assigning higher weights to the minority class instances, the model is encouraged to focus more on these instances during training, resulting in better performance on imbalanced datasets.

By using these strategies, machine learning models can achieve better performance on imbalanced datasets and produce more reliable predictions.

Complete Code For Handling Imbalnce Data

The complete code is placed below, you can also fork the code in our Github repo.

Conclusion

When handling imbalanced datasets, there is no one right solution to improve the accuracy of the prediction model. We need to try out multiple methods to figure out the best-suited sampling techniques for the dataset.

Depending on the characteristics of the imbalanced data set, the most effective techniques will vary. In most cases, synthetic techniques like SMOTE will outperform conventional oversampling and undersampling methods.

For better results, we can use synthetic sampling methods like SMOTE and advanced boosting and ensemble algorithms.

Recommended Courses

Recommended
educative-machine-learning

Deep Learning Course

Rating: 4.5/5

supervised learning

Machine Learing Course

Rating: 4/5

Machine learning

NLP Course

Rating: 4.5/5

Follow us:

FACEBOOKQUORA |TWITTERGOOGLE+ | LINKEDINREDDIT FLIPBOARD | MEDIUM | GITHUB

I hope you like this post. If you have any questions ? or want me to write an article on a specific topic? then feel free to comment below.

4 Responses to “Best Ways To Handle Imbalanced Data In Machine Learning

  • Hi Saimadhu, nice article, well written. I would however like to note that in most cases, unbalanced data is not a problem and balancing the data often does more harm than good. Even in your example, the unbalanced model is the best in predicting the spam cases, compared to the balanced alternatives. In my experience, building predictive models for many companies in numerous contexts, balancing the data is most often not needed and therefore not wise to do since you either loose information (undersampling) or add noise / overfitting (oversampling / smote) to the data.

    • Hi Jurraiaan Nagelkerke,

      Thanks for sharing your views. Yes, you are correct, the example showed in the article spam or ham is having a decent amount of accuracy compared to the model build after handling the imbalance data. The intention is to explain how we can handle imbalance data. In the industry, we face the target data issues like in the ration of 98: 2 cases, The main intention of the article is to explain how we can use different imbalance data techniques.

  • Really useful information on how to handle imbalance data. Thanks for posting 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

>