Best Ways To Handle Imbalanced Data In Machine Learning
Did you know that a staggering 75% of machine learning projects fail to make it into production? One of the primary reasons behind this is the prevalence of imbalanced data.
Imbalanced data is a common issue in machine learning, where the distribution of classes in the target variable is uneven. In this scenario, machine learning models tend to favor the majority class while neglecting the minority class, leading to poor performance and biased predictions.
Building models for the balanced target data is more comfortable than handling imbalanced data; even the classification algorithms find it easier to learn from properly balanced data.
But in real-world, the data is not always fruitful to build models easily. We need to handle unstructured data, and we need to handle imbalance data.
So as a data scientist or analyst, you need to know how to deal with class imbalance.
This article delves into the challenges posed by imbalanced data, its impact on model performance, and the various techniques that can be employed to handle such situations effectively.
By addressing imbalanced data, we can significantly improve the success rate of machine learning projects and ensure that our models are robust and reliable.
Best way to handle imbalanced data in machine learning
Before we go further, Let's look at the topics you will learn by the end of this article.
What is Class Imbalance data in Machine Learning?
In machine learning class imbalance is the issue of target class distribution. Will explain why we are saying it is an issue. If the target classes are not equally distributed or not in an equal ratio, we call the data having an imbalance data issue.
Examples of balanced and imbalanced datasets
Let me give an example of a target class balanced and imbalanced datasets, which helps in understanding about class imbalance datasets.
Balanced datasets:-
- A random sampling of a coin trail
- Classifying images to cat or dog
- Sentiment analysis of movie reviews
Suppose you see in the above examples. For the balanced datasets, the target class distribution is nearly equal.
For example, In the random coin trail, even the researchers say the probability of getting head is higher than the tail Still, the distribution of head and tail is nearly equal. It is the same with the movie review case too.
Class Imbalance dataset:-
- Email spam or ham dataset
- Credit card fraud detection
- Machine components failure detections
- Network failure detections
But when it comes to the imbalanced dataset, the target distribution is not equal. For email spam or ham, distribution is not equal.
Just imagine how many emails we receive every day and how many were classified as spam. Google uses its email classifier to do that.
In general, out of 10 emails, we receive one will go to the spam folder, and the other emails will go to the inbox. Here the ham and spam ration is 9:1 In credit card fraud detection the ration will much lesser like 9.5: 5
By now, we are clear about imbalanced data. Now, let’s learn why we need to balance data. In other words, why we need to handle the imbalanced data.
Understanding Imbalanced Data and Its Impact on Machine Learning Models
Why we need to balance data before building the machine learning models ?
The answer is quite simple, to make our predictions more accurate.
Because if we have imbalanced data, the model is more biased to the dominant target class and tends to predict the target as the predominant target class.
Let say in the credit fraud detection out of 100 credit applications. Only 5 applications will fall into the fraud category. So any machine learning model will be tempted to predict the outcome against the fraud class. This means the model predicts the credit applicant is not a fraud.
The trained model predicting the dominant class is reasonable as all the machine learning models while learning to try to reduce the error as the minority classes are very less while leaning. It won’t consider reducing the errors for the minority class and always trying to get fewer errors for predicting the majority class.
This Imbalanced data is a common issue in the realm of machine learning, where the distribution of classes in the target variable is uneven. This disproportion of classes can lead to a bias in machine learning models, as they tend to prioritize the majority class while neglecting the minority class.
Causes of imbalanced data:
- Nature of the data: In certain real-world scenarios, the data is naturally imbalanced. For instance, in fraud detection, the number of fraudulent transactions is significantly lower than the number of legitimate transactions. Similarly, in medical diagnosis, the occurrence of rare diseases is much less frequent than common ailments.
- Data collection bias: Imbalanced data can also arise due to biases in data collection. This can happen when data is collected from sources that are not representative of the entire population, leading to an over-representation of one class and under-representation of another.
- Insufficient data: Another cause of imbalanced data is the lack of sufficient data for the minority class. This can happen when data collection is difficult, time-consuming, or expensive, leading to a smaller sample size for the minority class.
Effects of imbalanced data on model performance:
- Model bias: When training on imbalanced data, machine learning models tend to favor the majority class, as they are optimized to minimize the overall error rate. This results in a biased model that performs poorly on the minority class.
- Inaccurate performance metrics: Imbalanced data can also lead to misleading performance metrics. For example, in a dataset with a 95% majority class and a 5% minority class, a model that always predicts the majority class will achieve a 95% accuracy rate. However, this high accuracy rate is misleading, as the model fails to capture the minority class, which is often the class of interest.
- Overfitting: Imbalanced data can cause overfitting, where the model learns to memorize the majority class and fails to generalize to new, unseen data. This results in poor performance on real-world applications, as the model cannot adapt to variations in the data.
To address these challenges and improve the performance of machine learning models, various techniques can be employed to handle imbalanced data effectively.
By understanding the causes and effects of imbalanced data, we can make informed decisions about the appropriate strategies to mitigate its impact and ensure that our models are both robust and reliable. So to handle these kinds of issues, we need to balance the data before building the models.
How to deal with imbalance data
To deal with imbalanced data issues, we need to convert imbalance to balance data in a meaningful way. Then we build the machine learning model on the balanced dataset.
In the later sections of this article, we will learn about different techniques to handle the imbalanced data.
Before that, we build a machine learning model on imbalanced data. Later we will apply different imbalance techniques.
So let’s get started.
Model on Imbalance data
About Dataset
We are taking this dataset from Kaggle, and you can download from this link
The dataset contains one set of SMS messages in English of 5,574 messages, tagged according to ham (legitimate) or spam.
The files contain one message per line. Each line is composed of two columns: v1 includes the label (ham or spam), and v2 contains the raw text.columns.
The main task was to build a prediction model that will accurately classify which texts are spam?
Let’s have a look at the loaded data fields.
We have the target variable v1, which contains the ham or spam and information, v2 having the actual SMS text. In addition to it, we also have some unnecessary fields. We will be removing them with the below code.
We renamed the loaded data fields to
- label
- text
Data ratio
Using the seaborn countplot let's visualize the ham and spam targets ration.
- Ham messages : 87%
- Spam messages : 13%
We can clearly see how the data was imbalanced, before going to create a model we need to do data preprocessing.
Data Preprocessing
When we are dealing with text data, first we need to preprocess the text and then convert it into vectors.
Stemming is actually removing the suffix from a word and reducing it to its root word. First use stemming technique on text to convert into its root word.
We generally get text mixed up with a lot of special characters,numerical, etc. we need to take care of removing unwanted text from data. Use regular expressions to replace all the unnecessary data with spaces
Convert all the text into lowercase to avoid getting different vectors for the same word . Eg: and, And ------------> and
Remove stopWords - “stop words” typically refers to the most common words in a language, Eg: he, is, at etc. We need to filter stopwords
Split the sentence into words
Extract the text except for stopwords
Again join them into sentences
Append the cleaned text into a list (corpus)
Now our text is ready , convert the text into vectors using Countvectorizer
Convert target label into categorical
Model Creation
First, we simply create the model with unbalanced data, then after try with different balancing techniques.
Let us check the accuracy of the model.
We got an accuracy of 0.98, which was almost biased.
Now we will learn how to handle imbalance data with different imbalanced techniques in the next section of the article.
Techniques for handling imbalanced data
For handling imbalance data we are having many other ways, In this article, we will learn about the below techniques along with the code implementation.
- Oversampling
- Undersampling
- Ensemble Techniques
In this article we will be focusing only on the first 2 methods for handling imbalance data.
OverSampling
In oversampling, we increase the number of samples in minority class to match up to the number of samples of the majority class.
In simple terms, you take the minority class and try to create new samples that could match up to the length of the majority samples.
Let me explain in a much better way.
E.g., Suppose we have a data with 100 labels with 0’s and 900 labels with 1’s, here the minority class 0’s, what we do is we increase the data 9:1 ratio, i.e., for everyone data point it will increase 9 times results in creating new 9 data points on that top of one point.
Mathematically:
1 label --------------> 900 data points
0 label ---------------> 100 data points
+ 800 points
-----------------------------------------------------------
900 data points
Now the data ratio is 1:1 ,
1 label ------>900 data points
0 label ------> 900 data points
Oversampling Implementation
We can implement in two ways,
- RandomOverSampler method
- SMOTETomek method
First, we have to install imblearn library, to install enter below command in cmd
Command: pip install imbalanced-learn
RandomOverSampler
It is the most sophisticated method of oversampling to randomly sample the minority classes and simply duplicate the sampled observations.
RandomOversampler Implementation in python
Here,
- x is an independent features
- y is a dependent feature
If you want to check the samples count before and after oversampling, run the below code.
SMOTETomek
Synthetic Minority Over-sampling Technique(SMOTE) is a technique that generates new observations by interposing between observations in the existing data.
In Simple terms, It is a technique used to generate new data points for the minority classes based on existing data.
Smotetomek implementation in python
Here ,
- x is a set of independent features
- y is a dependent feature
If you want to check the samples count before and after oversampling, run the below code.
Now let’s implement the same model, with the oversampled data.
Let’s check the accuracy of the model.
We can see we got a very good accuracy for balanced data, tp and tf are increased. Where
- TP: Ture Positive
- TF: Ture Negative
The tp and tf are the components from the confusion matrix.
Oversampling pros and cons
Below are the listed pros and cons of using the oversampling technique.
UnderSampling
In undersampling, we decrease the number of samples in the majority class to match the number of samples of the minority class.
In brief, you take the majority class and try to create new samples that match the length of the minority samples.
Let me explain in a much better way
E.g., Suppose we have a data with 100 labels with 0’s and 900 labels with 1’s, here the minority class 0’s, what we do is we balance the data from 9:1 ratio to 1:1 ratio i.e., We randomly select 100 data points out of 900 data points in majority class. Results in 1: 1 ratio, i.e.,
1 label ----------------> 100 data points
0 label -----------------> 100 data points
Undersampling Implementation
We can implement in two different ways,
- RandomunderSampler method
- NearMiss method
Random undersampling Implementation
It simply samples the majority class at random until it reaches a similar number of observations as the minority classes.
Here,
- x is independent features.
- y is a dependent feature.
If you want to check the samples count before and after undersampling, run the below code.
NearMiss Implementation
It selects samples from the majority class for which the average distance of the N closet samples of a majority class is smallest.
Here,
- x is independent features
- y is a dependent feature
If you want to check the samples count before and after undersampling, run the below code.
Now we will implement the model using the undersampling data.
Now let’s check the accuracy of the model.
Under-sampling gives less accuracy for smaller datasets because you are actually dropping the information. Use this method only if one has a huge dataset.
Undersampling pros and cons
Below are the listed pros and corns of using the undersampling techniques
When to use oversampling VS undersampling
We have a fair amount of knowledge on these two data imbalance handling techniques, but we use them as both the methods are for handling the imbalanced data issue.
- Oversampling: We will use oversampling when we are having a limited amount of data.
- Undersampling: We will use undersampling when we have huge data and undersampling the majority call won't effect the data.
Advanced Resampling Techniques
Resampling techniques are essential tools for addressing the issue of imbalanced data in machine learning. These methods modify the original dataset to create a balanced representation of the majority and minority classes, which helps improve model performance.
In this section, we will discuss three advanced resampling techniques:
- Synthetic Minority Over-sampling Technique (SMOTE),
- Adaptive Synthetic (ADASYN),
- Tomek Links.
Synthetic Minority Over-sampling Technique (SMOTE)
SMOTE is an oversampling technique that generates synthetic samples for the minority class. The algorithm works by selecting a random instance from the minority class and finding its k-nearest neighbors.
A synthetic instance is then created by interpolating between the selected instance and one of its neighbors. This process is repeated until the desired number of synthetic samples is generated, leading to a balanced dataset. SMOTE helps improve the model's ability to generalize and reduces the risk of overfitting.
Adaptive Synthetic (ADASYN)
ADASYN is an extension of the SMOTE technique, designed to adaptively generate synthetic samples based on the local density of the minority class instances.
The main idea behind ADASYN is to generate more synthetic samples for instances that are harder to classify, thus emphasizing the learning of the model in more challenging regions. This method calculates the number of synthetic instances to be generated for each minority instance based on the ratio of its nearest neighbors that belong to the majority class.
By focusing on areas where the model struggles, ADASYN helps enhance the model's performance on the minority class.
Tomek Links
Tomek Links is an undersampling technique that aims to remove instances from the majority class that are close to the minority class instances. A pair of instances from different classes is considered a Tomek Link if there is no instance of either class closer to each other than these two instances.
By removing the majority class instances that form Tomek Links, the algorithm reduces the overlap between the classes and improves the decision boundary, leading to better classification performance.
By employing these methods, we can create balanced datasets that enable machine learning models to perform better on both majority and minority classes, resulting in more accurate and reliable predictions.
Cost-Sensitive Learning for Imbalanced Data
Cost-sensitive learning is an alternative approach to addressing imbalanced data in machine learning models. Instead of modifying the dataset itself, cost-sensitive learning focuses on adjusting the learning process to account for the different costs associated with misclassification errors.
Incorporating Misclassification Costs
In cost-sensitive learning, we assign different costs to misclassifying instances from different classes. This way, the learning algorithm becomes aware of the imbalance and can prioritize minimizing the cost of misclassification errors, rather than simply maximizing the accuracy.
By assigning a higher cost to misclassifying minority class instances, the model is encouraged to pay more attention to them during the learning process.
Cost-Sensitive Algorithms
Several machine learning algorithms can be adapted to incorporate cost-sensitive learning. Some popular cost-sensitive algorithms include:
- Cost-Sensitive Decision Trees: Decision tree algorithms can be modified to use misclassification costs when choosing the best split at each node. This encourages the tree to prioritize splits that minimize the overall cost of misclassification.
- Cost-Sensitive Support Vector Machines (SVM): In SVM, the cost parameter C can be adjusted separately for each class, making the model more sensitive to instances from the minority class.
- Cost-Sensitive Logistic Regression: Logistic regression can be modified by incorporating class weights into the objective function, which penalizes the model more for misclassifying instances from the minority class.
Advantages and Limitations of Cost-Sensitive Learning
Advantages:
- Cost-sensitive learning directly addresses the issue of imbalanced data by prioritizing the minority class during the learning process.
- Unlike resampling techniques, cost-sensitive learning does not modify the dataset, thus preserving the original distribution of the data.
- It is applicable to various machine learning algorithms and can be easily incorporated into existing models.
Limitations:
- Determining appropriate misclassification costs can be challenging and may require domain knowledge or trial-and-error experimentation.
- Cost-sensitive learning may not be suitable for all types of data or problems, particularly when the minority class instances are too few or too noisy.
By incorporating misclassification costs and utilizing cost-sensitive algorithms, machine learning models can achieve better performance on imbalanced datasets and produce more reliable predictions.
However, it is essential to carefully consider the advantages and limitations of cost-sensitive learning and choose the best approach based on the specific problem and dataset at hand.
Ensemble Methods for Imbalanced Data
Ensemble methods, which combine multiple base learners to produce a more accurate and robust model, can be adapted to handle imbalanced data effectively. In this section, we will discuss adapting bagging and boosting for imbalanced data, as well as the Balanced Random Forest algorithm.
Adapting Bagging for Imbalanced Data
Bagging, or bootstrap aggregating, involves training multiple base learners on different random subsets of the data and combining their predictions. To adapt bagging for imbalanced data, two common techniques are used: underbagging and overbagging.
- Underbagging: In underbagging, random subsets of the majority class are selected and combined with the complete minority class to form new balanced datasets. Base learners are then trained on these balanced datasets and combined using majority voting or averaging.
- Overbagging: Overbagging involves oversampling the minority class to balance the class distribution in the training subsets. This can be done by duplicating minority class instances or using synthetic samples generated by techniques like SMOTE.
Adapting Boosting for Imbalanced Data
Boosting is an ensemble method that trains a series of base learners sequentially, where each learner tries to correct the errors made by its predecessor.
To adapt boosting for imbalanced data, adjustments can be made to the sample weights or the loss function to account for the class imbalance.
Balanced Random Forest Algorithm
The Balanced Random Forest algorithm is an extension of the Random Forest algorithm that handles imbalanced data by incorporating both underbagging and random feature selection.
This method creates balanced bootstrapped samples from the original data, and each tree in the ensemble is trained on one of these balanced samples.
Deep Learning Approaches for Imbalanced Data
Deep learning models, such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), can also be adapted to handle imbalanced data effectively. In this section, we will discuss adapting these models and incorporating class weighting.
Convolutional Neural Networks (CNN) for Imbalanced Data
CNNs can be adapted for imbalanced data by applying data augmentation techniques to the minority class, which helps increase the number of training samples and balance the class distribution.
Additionally, class weighting can be applied to the loss function, which penalizes the model more for misclassifying instances from the minority class.
Recurrent Neural Networks (RNN) for Imbalanced Data
Similar to CNNs, RNNs can also benefit from data augmentation and class weighting to handle imbalanced data. Furthermore, RNNs can be combined with other techniques such as oversampling and ensemble methods to improve their performance on imbalanced datasets.
Class Weighting in Deep Learning Models
Class weighting can be applied to the loss function of deep learning models to account for imbalanced data. By assigning higher weights to the minority class instances, the model is encouraged to focus more on these instances during training, resulting in better performance on imbalanced datasets.
By using these strategies, machine learning models can achieve better performance on imbalanced datasets and produce more reliable predictions.
Complete Code For Handling Imbalnce Data
The complete code is placed below, you can also fork the code in our Github repo.
Conclusion
When handling imbalanced datasets, there is no one right solution to improve the accuracy of the prediction model. We need to try out multiple methods to figure out the best-suited sampling techniques for the dataset.
Depending on the characteristics of the imbalanced data set, the most effective techniques will vary. In most cases, synthetic techniques like SMOTE will outperform conventional oversampling and undersampling methods.
For better results, we can use synthetic sampling methods like SMOTE and advanced boosting and ensemble algorithms.
Recommended Courses
Deep Learning Course
Rating: 4.5/5
Machine Learing Course
Rating: 4/5
NLP Course
Rating: 4.5/5
Hi Saimadhu, nice article, well written. I would however like to note that in most cases, unbalanced data is not a problem and balancing the data often does more harm than good. Even in your example, the unbalanced model is the best in predicting the spam cases, compared to the balanced alternatives. In my experience, building predictive models for many companies in numerous contexts, balancing the data is most often not needed and therefore not wise to do since you either loose information (undersampling) or add noise / overfitting (oversampling / smote) to the data.
Hi Jurraiaan Nagelkerke,
Thanks for sharing your views. Yes, you are correct, the example showed in the article spam or ham is having a decent amount of accuracy compared to the model build after handling the imbalance data. The intention is to explain how we can handle imbalance data. In the industry, we face the target data issues like in the ration of 98: 2 cases, The main intention of the article is to explain how we can use different imbalance data techniques.
Really useful information on how to handle imbalance data. Thanks for posting 🙂
Hi,
Thanks for the compliment.
We wish you a very happy learning.