Evaluating classification models simplified.

8 min readJun 10, 2022

If you cannot measure it, you cannot improve it. ~ Lord Kelvin

Any project that someone does can either be called a success or not. The machine learning world is no different either and we need a way to know when to stop and say our model is a success.

Introduction

Terminologies

Positive Class- Example: If a group of patients are being tested for COVID-19 and the doctor says that 70% of those tested the test result came out as positive, what does this mean? It simply means that the patients have been confirmed to have COVID-19.

Negative Class — For the example above if the doctor said that for the remaining 30% of patients the test result was negative this means that the patients do not have COVID-19.

True Positive(TP): the case was positive, and it was predicted as positive. You are predicting a customer will default on a bank loan and they actually defaulted.

True Negative(TN): the case was negative, and it was predicted as negative. You are predicting a customer will not default on a bank loan and they did not default.

False Negative(FN): the case was positive but was predicted as negative. You are predicting the customer will not default on a bank loan and they actually defaulted.

False Positive(FP): the case was negative but was predicted as positive. You are predicting a customer will not default on a bank loan but they however defaulted.

True Positives and True Negatives are always good. False positives also known as type 1 errors are simply false alarms. False Negative also known as a type 2 error is not good. Therefore your objective will be to identify whether you want to reduce the type 1 error or the type 2 error. This will be entirely answered by your business problem. We will see examples later in this article.

Confusion Matrix

A confusion matrix is a table that describes the performance of a classification model. It displays results of True Positives, True Negatives, False Positives, and False Negatives. Therefore you can simply say that it is a table that shows how good our model is at predicting examples of various classes.

Assuming you were predicting diabetic patients, then a confusion matrix for the count of the various classes could be as below depending on how your model performs:

Note:

0: Negative Class(Not Diabetic)

1: Positive Class(Diabetic)

True Positives[1,1]- We correctly predicted that 49 people have diabetes.

True Negatives[0,0]- We correctly predicted that 123 people do not have diabetes.

False Positives[0,1]- We predicted that 19 people have diabetes but they don’t have diabetes.

False Negatives[1,0]- We predicted that 40 people don’t have diabetes but they actually have diabetes.

The confusion matrix helps us compute evaluation metrics for classification models. These metrics will help us get a score and measure how well a model is performing and then if need be, tune the model up to a point where we have a good score. These metrics are Accuracy, Precision, Recall, and F1-Score.

Accuracy

Accuracy means you want to calculate the ratio of the samples you predicted correctly to the entire samples you have.

From our classes in the image below, we want to calculate how many did we get right in the sick[1] class and how many did we get right in the not sick[0] classes. This is therefore the True Positives(We predicted to be sick and they are actually sick), and the True Negatives(We predicted they are not sick and they are indeed not sick)

Accuracy from the confusion Matrix is calculated as below;

From our image example above, accuracy is therefore 10/18 or 55.56% where 10 are the True Positives and True Negatives and 18 is our entire set of observations.

Disclaimer: If you have a skewed dataset then Accuracy is not the best metric so use. This is because it is possible to have a model that gives incorrect predictions while still having high accuracy. For instance, if 95% of your dataset belongs to one class, then your model will learn very well to predict just that one class and hence achieve a high accuracy score. With a highly imbalanced dataset, you can almost guess without the help of a machine learning model which class a sample belongs to. We, therefore, resolve to other metrics to help with evaluation. These are the Precision, Recall and F1-score.

Precision

It is stated as, out of all the positive samples (Class Label 1), predicted by a model, how many are actually positive. example: out of all the people we predicted to be sick(both TP and FP), how many are actually sick(TP).

The best way to remember precision is to remember that we are focusing on Class Label 1[Positive predictions].

From the image below, True positives and False Positives are on the Predicted Sick[1] people side. For precision, we ignore the Not sick side[0] and focus on the sick side[1] where we have both TP and FP.

Precision from the confusion Matrix is calculated as below;

From the image above our precision is therefore 8/8+2=0.8 or 8/10 where 8 is the number of those we correctly predicted to be sick(TP) and 10 is the total number of observations or simply all those we predicted to be sick, whether they are sick(TP) or not sick(FP)

0.8 to percentage is 80%. This means that 80% of the people we predicted to be sick are actually sick. In terms of ratio, this means 2 of every 10 diabetic labeled people are healthy, and 8 are diabetic.

High precision means that we have a high number of true positives. Low precision means that we have a high number of false positives.

Recall

For Recall, we observe both sides, but with a focus on those that are actually sick on either side(TP and FN). We include observations from both the sick side[1] and not sick side[0] where we have both TP and FN.

Remember that TP and FN mean that our patients are actually sick. From the image below, True positives and False Positives are both on the sick side[1] and not sick side[0].

Precision from the confusion Matrix is calculated as below;

From the recall image above:

How many people are predicted to be sick(TP)? 8

How many people are actually sick(TP and FN)?8+3

Therefore, recall is 8/(8+3) or 8/11 =0.73 where 8 is the number of those we correctly predicted as sick and 11 is the total number of those correctly and incorrectly predicted to be sick.

0.73 to percentage is 73%. This means that 73% of the sick people were correctly predicted.

A low recall score can indicate that we have a high number of False Negatives. For instance from the image below, if we had predicted more False Negatives(those predicted to not being sick and are actually sickly), our recall will change and be: 2/(2+6) or 2/6 = 0.33 score or 0.33%

F1- Score

If precision increases, recall decreases and vice versa. Therefore, a new measure was introduced called F1-score. F1-Score calculates both Precision and Recall and gives a single score that conveys the balance between Precision and Recall. The best score is 1.0, whereas the worst score is 0.0.

What metrics should someone focus on?

The answer to this is highly dependent on your business objective and it’s always good to consult a domain expert to guide you with this. Both Precision and Recall are important depending on the business problem at hand.

Before looking at Recall or Precision, identify whether FP or FN is more important to reduce.

If your objective is to reduce False Negatives, then you automatically know we are looking at Recall.

If your objective is to reduce False Positives, then you know we are looking at Precision.

If you want to strike a balance between Recall and Precision, then use the F1 score.

Example 1:

A model predicting diabetes in patients as positive (present) or negative (not present), is expected to detect diabetic patients so they can be treated.

Will the objective be to reduce False Positives or False Negatives? False Negatives are risky since many people might be left untreated and eventually cause severe health issues.

In this case, we want to reduce False Negatives and we will essentially be looking at Recall — (High Recall). The Higher the Recall, the fewer the false Negatives we have.

Example 2:

A model predicting whether an email is a spam or not spam.

Will the objective be to reduce False Positives or False Negatives? False Positives are riskier to have since you could have important genuine emails that will go to spam and therefore miss out on important information. False Negatives are more acceptable in this case since a spam email going to the inbox might not cause any loss of important information. In this case, we want to reduce False Positives and hence look at Precision

Example 3:

Detecting whether a transaction is fraudulent or not.

Will the objective be to reduce False Positives or False Negatives?FP (You predicted they are fraudulent and are not fraudulent).FN(You predicted they are not fraudulent and are actually fraudulent).

In this case, we want to reduce False Negatives and we will essentially be looking at Recall — (High Recall). The Higher the Recall, the fewer the false Negatives we have.

Confusion Matrix

My confusion matrix is computed by comparing what I have predicted(y_pred) and what was the actual value(y_test)

In the below image, I am just beautifying the confusion matrix using seaborn:

Accuracy, Precision, Recall, and F1-score

We now get the accuracy score, precision recall, and f1-score as shown in the image below.

There is a slight difference between the macro and weighted average. I highly recommend this article to get an understanding of the difference between the two and which average to choose.

Thank you for reading. Link to Github Code

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

Evaluating classification models simplified.

Introduction

Terminologies

Confusion Matrix

Accuracy

Precision

Recall

F1- Score

What metrics should someone focus on?

Example 1:

Example 2:

Example 3:

Confusion Matrix

Accuracy, Precision, Recall, and F1-score

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Joan Ngugi