Top 3 Classification Machine Learning Metrics — Ditch Accuracy Once and For All

Most of the time, especially as a beginner, the evaluation metric we choose for the classification problems is accuracy. However, it might not be useful and can be misleading, especially if the dataset is not balanced. In this article, we will cover some other evaluation metrics that are more reliable and can be used besides accuracy. Specifically, we will cover the following topics.

· Confusion Matrix

· Why can Accuracy be Misleading?

· Precision

· Recall

· Fbeta-measure.

Before going ahead with our discussion, let’s quickly discuss the confusion matrix.

Confusion Matrix

A confusion matrix is a table that shows the performance of a model, typically classification models. Usually, it looks like this for the binary classification:


As you can see, it contains the count of predicted values versus the actual ones. Let’s go ahead and describe each cell.

· True Positives: It shows the count where your model predicted samples as positive, and they were actually positive.

· True Negatives: It represents the count of correct negative classes, i.e., when your model predicted a negative class, and its ground truth was also negative.

· False Positives: It shows the count where your model classified data points as positive, but they were actually negative. As the name suggests, they got falsely classified as positive.

· False Negatives: This value represents the total number of times your model predicted samples as negative, but their ground truth was positive.

Let’s talk about accuracy now.

Why can Accuracy be Misleading?

Accuracy can be defined as the total correct predictions among all the sample predictions. If we look at the confusion matrix, we are concerned with true positives and true negatives.



Suppose you have a dataset for the prediction of Breast Cancer. Let’s say this dataset is highly imbalanced, having 90% of the samples with no breast cancer and 10% with breast cancer.

Now, we build a model that will predict no cancer detection every time. We know it is not a good model because it does not look at any of the data and only makes a single prediction. However, it will have an accuracy of 90%.


That’s how accuracy can be misleading.

Let’s now go ahead and discuss other evaluation metrics that are more reliable than accuracy.


Precision is defined as true positives over all positives predicted by the model. It can be calculated using the following formula:



Let’s clarify it using an example. Suppose we have a model to predict whether a patient has breast cancer or not. If we use precision as our evaluation metric, it will tell us how many of them were actually true out of all breast cancer predictions. Consider the following confusion matrix.


What will be the precision in this case? As you can see, the total breast cancer predicted by the model is 103, and out of which 83 have cancer in real. Therefore,



It means that when our model predicts that a patient has cancer, it is right around 80% of the time.


Recall is defined as the ratio of true positives over actual positives. It gets calculated using the following formula:



Let’s take the same example. Now, if we use recall as our evaluation metric, it will tell us, out all breast cancer patients, how many of them got identified by the model. For the confusion matrix in the above section, the total number of actual breast cancer patients is 109, and our model detected 83 patients.


So, 24% of the patients have cancer, but they remain undetected by the model. Now, you can see how recall can be an important metric.


As we saw in the above sections, precision focuses on the false positives. Recall, on the other hand, takes false negatives into account. Depending upon the problem at hand, you can choose one or the other metric. However, if you want to consider both, you can use the F-measure or the F-score.

It gets calculated using the following formula:


Now, let’s say you want to take both the precision and recall into consideration but with more focus on the recall. For this case, you can use the Fbeta-measure, which is the generalization of the F-measure, and it includes a beta parameter to assign weight to precision and recall.


· If the value of β = 1, then the Fbeta-measure will be the same as the F-measure, i.e., equal weight on the precision and the recall.

· If the value of β is less than 1, generally 0.5, there is more weight assigned to the precision and less to recall.

· If Beta is greater than 1, generally 2, then there is more focus on the recall and less on the precision.

Let’s use the above confusion matrix and calculate F0.5, F1, and F2 measures.














As you have seen in this article, accuracy is not always a great choice to evaluate your classification models, especially if your dataset is imbalanced or your problem focuses on the false positives or negatives, such as the breast cancer detection problem. However, there is no best evaluation metric. We need to look at the problem at hand and the dataset before choosing a metric.

July 2, 2021
© 2021 Ernesto.  All rights reserved.