Two of the most popular Explainers compared
Today, training an accurate model is not enough for its successful deployment. You should also be able to interpret that model, i.e., explains its behavior, the features that contribute to making a certain prediction, and the significant attributes overall.
A lot of work has been done in this important aspect of machine learning and data science, i.e., model interpretability. In this article, we will compare two popular Python libraries for model interpretability, i.e., LIME and SHAP.
Specifically, we will cover the following topics:
· Dataset Preparation and Model Training
· Model Interpretation with LIME
· Model Interpretation with SHAP
So, without further ado, let’s get started.
Note: The notebook for this tutorial can be found on my GitHub here.
Dataset Preparation and Model Training
For this article, we will use the Bank Loan Classification dataset from Kaggle. It contains the following attributes:
Age, experience in years, annual income, ZIP code, number of family members, CCAvg (customer’s average spending on the credit cards per month), education (1-undergraduate, 2-graduate, and 3-professional), mortgage, securities account, CD (certificate of deposit) account, online (whether the customer uses internet banking facilities), and credit card. The personal loan is our target variable. Its value is 1 if the person has accepted the loan offered in the last campaign. Otherwise, its value is 0.
Let’s go ahead and import the necessary modules. Then, we load the dataset into a pandas DataFrame.
Now, we drop the unnecessary columns such as ID and ZIP Code. After that, we check if the dataset contains any missing values or not.
As you can see, there are no null values.
The following code separates the target variable and the features. Moreover, we split the dataset into training and testing samples using the train_test_split() method imported from sklearn.preprocessing.
Let’s now train a random forest classifier with the given data. We achieve an accuracy of 99%.
We have now trained the model. So now, let’s go ahead and perform model interpretation with the LIME library.
Model Interpretation with LIME
LIME stands for Local Interpretable Model-Agnostic Explanations. As the name suggests, LIME is:
· Local: Lime performs local interpretation, i.e., it explains single predictions of a model.
· Model-agnostic: It is not dependent or specific to any model. Therefore, it can be used for any system.
Let’s go ahead and install LIME using the following command.
pip install lime or !pip install lime
First, we need to create a tabular explainer object using the lime_tabular.LimeTabularExplainer() method. It takes the training data, attributes names, mode (classification or regression), and class names.
Now, we will use the explain_instance()method of the explainer object to interpret a prediction. It takes observation and the prediction function. Since we are doing classification, we will pass the predict_proba method of the random forest object. If the mode is regression, we need to use the prediction function that generates the actual predictions.
We use the show_in_notebook() method to visualize the results.
Here, we pass the first test sample. As you can see in the output, the model is 100% sure that the current observation belongs to class 0 (not accepting a personal loan). In the second column, we can see the features that contribute towards this prediction (class 0) and the ones that oppose it(class 1). It also shows the relative importance of features. The last column shows attributes with their actual values.
By looking at the visualization, we can easily understand what’s going on, i.e., probability, which attributes contribute the most, and how much. However, it does not look that pretty. But we can get the same information and create visualization by ourselves. Let’s see.
You can also get somewhat similar information using the pyplot_figure() method. It returns a matplotlib bar chart. Let’s see.
Features in green color increase the likelihood of the current test sample to be predicted as class 1. Attributes in red color do the opposite.
Let’s now perform model interpretation with the SHAP library.
Model Interpretation with SHAP
SHAP stands for SHapley Additive exPlanations. It uses Shapely values as its basis, which tell how much a feature contributes towards a prediction. SHAP is:
· Locally interpretable
· Globally interpretable: SHAP also explains the entire behavior of the model, for example, the effect of a feature on the target variable.
Let’s see how to interpret a single prediction in SHAP. First, we need to install it using the following command.
pip install shap or !pip install shap
We use the TreeExpainer() method to create an explainer because of the random forest classifier (a tree-based model). Then, we obtain the SHAP values for the first test sample. To explain the prediction, we use the force_plot() method, i.e.,
As you can see in the above output, the visualization looks great. However, we cannot really tell what’s going on.
The features in red color increase the chances of the sample being classified to the current class. We passed the base value and the SHAP values for the first class (class 0). Therefore, the current class is class 0. Attributes in blue push the prediction lower. Moreover, the size of the feature block shows its importance, i.e., the larger the block, the more the contribution. The attributes are also ranked according to their importance. In the above output, you can see, the effect of red features is more than the blue ones. Therefore, the outcome is 0 (loan not accepted).
As already mentioned, we can also interpret the model globally with SHAP. We can do that using the dependence_plot()method. It takes an attribute of which we want to find the relationship with the target variable, the SHAP value, and the dataset. It also finds out the feature with which our given attribute interacts the most. Let’s see.
There is a kind of negative relationship between income and personal loan, i.e., high income decreases the chances of the test sample rejecting the personal loan.
The advantages of SHAP do not end here. It also allows us to interpret the predictions of the entire dataset using the summary_plot()function. Let’s see.
A high-income value pushes the prediction lower. Medium to high value of the education variable also decreases the chances of the current class. Moreover, features such as Online and Securities Account have little to no effect on the prediction as their SHAP values are near 0 (almost all the data points lie on the vertical line or near it).
We saw that LIME’s explanation for a single prediction is more interpretable than SHAP’s. However, SHAP’s visualizations are better. SHAP also performs global interpretation, i.e., explanation for the entire model.
Therefore, to conclude that, for local interpretation, LIME is preferred over SHAP. For global interpretation, on the other hand, SHAP can be used.