
With hands-on examples.
Model evaluation is an essential step in the design cycle of a machine learning system. Most of the time, it is difficult to decide which model we should use by looking at the data and the problem at hand. Therefore, we train different models and compare them.
Moreover, evaluating models using accuracy does not always provide us reliable results, especially if the dataset is imbalanced or the problem focuses on the false positives or negatives more. That’s where ROC curves and AUC come in, and we will discuss them in this article. They ease our task of model selection by allowing us to visualize the performance of different models.
Specifically, we will cover the following topics:
· ROC Curve
· AUC
· Use ROC Curve and AUC to Evaluate Different Models in Python
· Conclusion
Note: The notebook for this tutorial can be found on my GitHub here.
ROC Curve
ROC stands for Receiver Operating Characteristic. It is a probability curve that is used to evaluate classification models at different thresholds. In other words, it plots confusion matrices for various thresholds.
It plots a false positive rate (FPR) on the x-axis and a true positive rate (TPR) on the y-axis.
The false positive rate shows the percentage of negative classes classified as positive ones. For example, if we have a model that predicts breast cancer detection, FPR will tell us the percentage of persons that did not have cancer but got misdiagnosed by the model.

The true positive rate, on the other hand, represents the number of true positives over the actual positives. For breast cancer detection, TPR will tell us the percentage of patients that got correctly diagnosed with cancer by the model.

Let’s explain the ROC curve in more detail. Suppose we have a model that gives us a probability of a sample having breast cancer. Let’s now assume that the current threshold is 0.5, which means that if the probability is greater than 0.5, then the model predicts that the patient has breast cancer. Otherwise, the patient is healthy.
We can create a confusion matrix for the results of the model at this threshold. It will have a certain number of true positives, false negatives, etc. Now, changing this threshold will provide us different results. If we increase the threshold, then more samples will be classified healthy, therefore increasing false negatives. If we decrease it, then more data points will get detected with breast cancer. Thus, increasing false positives.
What the ROC curve does is that it plots confusion matrices for various thresholds, i.e., summarizes all that information.
AUC
AUC is nothing but the area under the ROC curve. It tells us how much the model is good at identifying classes, i.e., the ability of the model to separate positive and negative classes.
The value of AUC ranges between 0 and 1. The higher the value, the more the measure of separability and vice versa.
If the value of AUC is 1, it means that the model is correctly classifying all the positive and negative classes. For example, the model is correctly identifying sick patients with breast cancer and healthy patients with no breast cancer.

An AUC value of 0, on the other hand, represents the opposite case, i.e., all the sick patients get identified as healthy, and all the healthy ones are predicted sick.

When the AUC score is 0.5, then the model is unable to identify positive and negatives classes. This basically means that the model is of no use. For example, when the model is given a sample, there is a 50% chance that the model will classify it as healthy and a 50% chance that the person will get classified as sick.

An AUC value greater than 0.5 shows that the model has a good measure of separability. For example, if AUC is 0.8 and the model is provided with a sample, there is an 80% likelihood it will get classified correctly.

Let’s now go ahead and evaluate different models using the AUC-ROC curve in Python.
Use ROC Curve and AUC to Evaluate Different Models in Python
For this article, we will use the Breast Cancer Detection dataset from Kaggle. You can download it from here.
Let’s import the necessary modules and load the dataset into a Pandas DataFrame.



This dataset contains 70000 samples. For this article, we reduce its size to 1000 to decrease the training time.




In this dataset, the age is given in days. To make the interpretation simple, we convert it to years.

The next step is to split the data into training and testing samples using the train_test_split() method imported from sklearn.model_selection above. We reserve 20% of the data for testing.


Now, we train different models with the training data. Moreover, we also store their prediction probabilities for the positive class of the test data as they will be used later while computing the ROC and AUC. Furthermore, the models used are logistic regression, K-nearest neighbors, decision tree, random forests, and support vector machine.

Let’s now compute the AUC score, TPR, and FPR using the roc_auc_score() and roc_curve() methods of the sklearn.metricsmodule.

Now that we have the required information, let’s use it to plot the ROC curves and display the corresponding AUC scores.


As you can see in the visualization above, all the models are above the baseline. The SVM classifier has the highest AUC score of 0.82, showing it is the best model.
Conclusion
In this article, we discussed what the AUC-ROC curve is and how it allows us to perform model evaluation and selection. The AUC-ROC curves are pretty simple to comprehend. Using them, we can easily compare different models, as we saw in the above section.