Six Popular Classification Evaluation Metrics In Machine Learning

August 6, 2020 Sharmila Polamuri

Six Popular Classification Evaluation Metrics In Machine Learning

Evaluation metrics are the most important topic in machine learning and deep learning model building. These metrics help in determining how good the model is trained. We are having different evaluation metrics for a different set of machine learning algorithms.

For evaluating classification models we use classification evaluation metrics, whereas for regression kind of models we use the regression evaluation metrics.

There are a number of model evaluation metrics that are available for both supervised and unsupervised learning techniques.

In this article, we will focus on what kind of model evaluation metrics are used in quantifying the performance of the models built with supervised learning algorithms.

In particular, we learn about different classification evaluation metrics.

Before we go further let’s list down the topics we are going to learn in this article.

Table of Contents

Supervised and Unsupervised Learning Algorithms

1.1

Supervised Learning Algorithms

1.2

Unsupervised Learning

Why we need evaluation metrics

Classification Evaluation metrics

3.1

3.2

3.3

3.4

3.5

3.6

Supervised and Unsupervised Learning Algorithms

Supervised Learning Algorithms

Supervised Learning algorithms are only applicable to the dataset which has features as well as labels. For the given combination of feature, the expected target is called as label.

For example, If we know the shape and color of the fruit. We can strongly say what fruit it could be. Here shape and color are the features and fruit type is target (label).

Supervised learning algorithms further classified as two different categories.

Classification Algorithms
Regression Algorithms

Classification Algorithms

If the target value is categorical values like input image have a chair (label 1) or not having a chair (label 0) then we apply the techniques of classification algorithms. This means we will classify the features to any one of the available classes or labels.

Below are the most popular classification algorithms

Regression Algorithms

If the target values are real values like the price of the house or the approved loan amount. Then we apply the technique of Regression algorithms. Using regression techniques we can predict the real value, rather than target class.

Below are the most popular regression algorithms.

Unsupervised Learning

Unsupervised learning techniques are applicable for dataset which don't have any target values for features.

In this article we learn only the popular evaluation metrics which can be used for quantifying the classification algorithms.

In short we will learn classification metrics for evaluating the performance of the classification models.

Before that let’s learn why we need evaluation metrics.

Why we need evaluation metrics

For every machine learning or deep learning model. We need to know how good the model learnt from the training data. Also we need to know how good the same model will predict future or unseen data.

For this we need a way to measure the model performance. In machine learing these performance measure are nothing but evaluation metrics.

We are having some bunch of evaluation metrics, but we need to know which metrics need to use for what kind of problems.

Let’s say for classification models, we use the classification metrics. For regression kind of models, we use the regression metrics.

If we don’t know which evaluation metrics to use, then we will compare the oranges to apples kind of comparison. Which is not at all good.

In the scope of this article. We will learn different evaluation metrics used in testing performance of classification algorithms.

Let’s get started then.

Classification Evaluation metrics

Here we will discuss 6 different types of classification evaluation metrics.

Accuracy
Confusion matrix
Precision
Recall
F1 Score
Log Loss

Accuracy

Let me take one example dataset that has binary classes, means target values are only 2 classes or labels like 0 and 1. We consider the dataset as a sentiment analysis dataset. where we have 2 labels like positive (label 1) and negative (label 0).

Here our task is to build a model that will predict whether a given input sentence is positive (label 1) or negative (label 0) sentiment.

For example. Our dataset has 200 sentences. The target class 0 has 100 sentences and class 1 has 100 sentences. This means each target class has equal number of sentences.

The first step is to divide the whole dataset into 2 parts as a training dataset and test dataset.

Training Dataset: Useful for model training.
Test Dataset: Useful for validating the trained model.

Now each dataset has an equal number of sentences and as well as an equal number of labeled sentiments mean 50 sentences have class 0 and 50 sentences have class 1.

When we have an equal number of samples for both classes or all classes then we can use Accuracy, Precision, Recall, and F1 Score as evaluation metrics.

Accuracy measure python implementation

Output

Below is the output for the above code.

Accuracy from scikit-learn library

We can also calculate the accuracy measure with scikit-learn. Using the below code we can achieve the same result.

Output

Confusion matrix

Confusion matrix gives the matrix representation, to give a clear picture about how well the individual target classes are predicted by the model.

It’s quite normal that we always forget about the individual components of the confusion matrix. We refresh the confusion matrix concept once again with the below image.

I would suggest to spend time on understanding the confusion matrix, as all the other classification metrics are using the individual components of the confusion matrix.

Confusion matrix code with Sklearn

Output

Below is the output for the above confusion matrix code.

Precision

Precision evaluation metric is useful for the skewed or unbalanced datasets. Skewed dataset means one class has very fewer samples than another class.

For example in sentiment analysis the dataset has 200 sentences. Out of 200 let say 20 samples belong to label 0, and the remaining 180 samples are label 1.

In this case, accuracy is not performed well as an evaluation metric. We can get high accuracy but the model will not be performed well on unseen data.

Before going to discuss more about precision metrics we have to know a few terms here. Just remember in our dataset positive class has label 1 and negative class has label 0.

The below terms are considered from the confusion matrix.

True Positive (TP) :- The number of positive samples correctly predicted by the trained model as positive class. This means if the actual target value is 1 model also predicted as 1.
True Negative (TN) :- The number of negative samples correctly predicted as negative class by the trained model. This means if the actual target value is 0, the model also predicted as 0.
False Positive (FP) :- The number of negative target classes are incorrectly predicted as positive by the trained model.This means if the actual target value is 0 model is incorrectly predicted as 1.
False Negative (FN) :- The number of positive target classes are incorrectly predicted as negative by the trained model.This means if the actual target value is 1, the model incorrectly predicted as 0.

Let's see python implementations of all these terms.

Confusion matrix individual components with python

Now Let's call all the above terms in single funciton.

Output

Below is the output for the above code.

Using the individual components of confusion matrix we can calculate accuracy measures also. Below is the formula for that.

Accuracy Score = (TP+TN)/(TP+TN+FP+FN)

Now we will be creating a function which calcautates each these confusion matrix components and calculates the accuracy.

Output

Output of the above code, this will be similar to the accuracy score calculated before.

Now let go back to precision. For calcuating precision we will be using the below formula.

Precision = TP / (TP+FP)

Let me explain about precision score with an example.

Remember our dataset has 80 samples belonging to label 0 and 20 samples belong to label 1. From these samples, if our trained model correctly predicted 60 samples as label negative out of 80 samples and also correctly predictly 10 samples as label as positive out of 20 samples.

If we calculate accuracy for this model , we will get an accuracy score as 0.7 (70% of accuracy).

But the model misclassified as negative class 10 out of 20 positive classes and as positive class 20 out of 80 negative samples.

From these samples :-

TP = 10
FP = 20
TN = 60
FN = 10

Precision value = 10/(10+20) = 0.33

This means the model is predicting correct outputs only 33% times when it's trying to identify positive samples .

Precision python code implementation

Output

Below is the calculated precision score calculated from using the above code.

Recall

Recall is one of the evaluation metrics , by using this metric we can know the number of correct positive classes out of all positive classes. Along with this information recall also provide information about missclassified positive classes.

Recall evaluation metric can be defined as below.

Recall = TP/(TP+FN)

From the above precision example , we can calculate recall metric value also, for these values of

TP = 10
FN = 10

Recall = 10/(10+10)

recall score is 0.50 this means our model will identify 50% of positive samples correctly. Now we will write a function to calcuate the recall value.

Output

The calculated recall score, using the above code.

Well trained machine learning models will have high recall and precision scores.

In real time problems. We have to take some threshold value and repredict the predicted values for precision and recall metrics based on threshold value. We can plot graphs by using these precision and recall values. This gives us a graph curve which is called a precision-recall curve.

Will calculate the usage threshold by calling the above function with input values.

Output :

We can find the best threshold value for precision and recall from the precision-recall curve.

A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate.

In real-time scenario this is difficult to choose threshold value to get good values for both Precision and Recall scores. If we take high threshold value, we will get a smaller number of true positives and higher number of false negatives.This will decrease our recall values, but increase precision value. If we take low threshold value false positives are high then precision value will be less.

Both Precision and Recall values range between 0 to 1. If these values are close to 1 this means model will be a good predicted model. Due to this threshold value selection we can use another evaluation metric which combines both these metrics (Precision and Recall).

F1 score

F1 score is the combination of both precision and recall score. We can define F1-score as simple weighted average of precision and recall.vIf we denote R for Recall and P for Precision we can defined F1-score as below

F1-score = 2PR / (P+R)

We can also define F1-score using True,Positive terms then

F1-score = 2TP / (2TP+FP+FN)

Python code for F1-score metric

Output

Below is the output for the calculated F1 score

Instead of looking at precision and recall individually. You can also just look at F1 score. Same as for precision, recall and accuracy, F1 score also ranges from 0 to 1. The perfect prediction model has a value of F1 score is equal to 1. When dealing with datasets that have skewed targets, we should look at F1 (or precision and recall) instead of accuracy.

Log-Loss

This is the last evaluation metric in this article for machine learning classification problems. We can defined log-loss metric for binary classification problem as below

Log loss = -1.0 * ( y_true * log(y_pred) + (1-y_true) * log(1- y_pred) )

Here y_pred are probabilities of corresponding samples.

Implementation of Log loss with python code

Output

Output of the log loss after calling the log loss metrics function.

We can also implement a log loss evaluation metric by using sklearn.

Output

Below is the output for log loss metric with scikit learn

Log loss penalizes quite high for an incorrect or correct prediction means this metric will punish you for being very sure and very worst other than remaining metrics.

For example,

If you are 51% sure about a sample belonging to class 1

log loss would be: - 1.0 * (1 * log(0.51) + (1 - 1) * log(1 – 0.51)) = 0.67

And if you are 49% sure for a sample belonging to class 0

log loss would be: - 1.0 * (0 * log(0.49) + (1 - 0) * log(1 – 0.49)) = 0.67

We can observe from the above example , both are given the same log loss values.

Complete Code

The complete code is placed below, you can also fork the code in our Github repo.

Conclusion

Every model which is either machine learning or deep learning should be evaluated after model training. Without evaluation developers don't know about how much a trained model learnt from history data.

We can evaluate any machine learning classification trained model using any one of the above metrics.

Not only to know about trained model performance we can also use these values to know where we can exactly focus to improve performance of models.